巨蟒中的多线性回归

小开

最佳答案

sklearn.linear_model.LinearRegression 会做到这一点:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
[t.y for t in texts])

然后 clf.coef_将有回归系数。

sklearn.linear_model 也有类似的界面对回归进行各种正则化。

小开

你可以使用 Numpy.linalg.lstsq

小开

下面是我创建的一个小工作，我用 R 检查了一下，它运行正确。

import numpy as np
import statsmodels.api as sm


y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]


x = [
[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
]


def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results

结果:

print reg_m(y, x).summary()

产出:

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.535
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     7.281
Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
Time:                        21:51:28   Log-Likelihood:                -26.025
No. Observations:                  23   AIC:                             60.05
Df Residuals:                      19   BIC:                             64.59
Df Model:                           3
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.2424      0.139      1.739      0.098        -0.049     0.534
x2             0.2360      0.149      1.587      0.129        -0.075     0.547
x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
const          1.5704      0.633      2.481      0.023         0.245     2.895


==============================================================================
Omnibus:                        6.904   Durbin-Watson:                   1.905
Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
Skew:                          -0.849   Prob(JB):                       0.0950
Kurtosis:                       4.426   Cond. No.                         38.6

pandas为运行 OLS 提供了一种方便的方法，如下所示:

用熊猫数据框运行 OLS 回归

小开

你可以使用 Numpy.linalg.lstsq:

import numpy as np


y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
[
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
]
)
X = X.T  # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])]  # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(beta_hat)

结果:

[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]

您可以通过以下方式查看估计输出:

print(np.dot(X,beta_hat))

结果:

[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]

小开

使用 scipy.optimize.curve_fit。而且不仅仅是线性拟合。

from scipy.optimize import curve_fit
import scipy


def fn(x, a, b, c):
return a + b*x[0] + c*x[1]


# y(x0,x1) data:
#    x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4


x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt

小开

您可以使用下面的函数并将 DataFrame 传递给它:

def linear(x, y=None, show=True):
"""
@param x: pd.DataFrame
@param y: pd.DataFrame or pd.Series or None
if None, then use last column of x as y
@param show: if show regression summary
"""
import statsmodels.api as sm


xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()


if show: print res.summary()
return res

小开

一旦你把你的数据转换成熊猫数据框架(df) ,

import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)

默认情况下包含拦截术语。

有关更多示例，请参见这本笔记本。

小开

只是澄清一下，你给出的例子是多个线性回归，而不是 多变量线性回归:

单个标量预测变量 x 和单个标量响应变量 y 的最简单的情况称为简单线性回归。对多值和/或向量值预测变量的扩展(用大写的 X 表示)被称为多线性回归，也被称为多变量线性回归。几乎所有现实世界的回归模型都涉及多个预测因子，对线性回归的基本描述通常用多元回归模型来表述。但是请注意，在这些情况下，响应变量 y 仍然是一个标量。另一个术语多变量线性回归是指 y 是向量的情况，也就是说，与一般线性回归相同。应该强调多元线性回归和多元线性回归之间的区别，因为它会在文献中造成很多混淆和误解。

简而言之:

多个线性回归: 响应 y 是一个标量。
多变量线性回归: 响应 y 是一个向量。

(另一个来源)

小开

我认为这可能是完成这项工作最简单的方法:

from random import random from pandas import DataFrame from statsmodels.api import OLS lr = lambda : [random() for i in range(100)] x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()}) x['b'] = 1 y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4 print x.head() x1 x2 x3 b 0 0.433681 0.946723 0.103422 1 1 0.400423 0.527179 0.131674 1 2 0.992441 0.900678 0.360140 1 3 0.413757 0.099319 0.825181 1 4 0.796491 0.862593 0.193554 1 print y.head() 0 6.637392 1 5.849802 2 7.874218 3 7.087938 4 7.102337 dtype: float64 model = OLS(y, x) result = model.fit() print result.summary() OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 1.000 Model: OLS Adj. R-squared: 1.000 Method: Least Squares F-statistic: 5.859e+30 Date: Wed, 09 Dec 2015 Prob (F-statistic): 0.00 Time: 15:17:32 Log-Likelihood: 3224.9 No. Observations: 100 AIC: -6442. Df Residuals: 96 BIC: -6431. Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 1.0000 8.98e-16 1.11e+15 0.000 1.000 1.000 x2 2.0000 8.28e-16 2.41e+15 0.000 2.000 2.000 x3 3.0000 8.34e-16 3.6e+15 0.000 3.000 3.000 b 4.0000 8.51e-16 4.7e+15 0.000 4.000 4.000 ============================================================================== Omnibus: 7.675 Durbin-Watson: 1.614 Prob(Omnibus): 0.022 Jarque-Bera (JB): 3.118 Skew: 0.045 Prob(JB): 0.210 Kurtosis: 2.140 Cond. No. 6.89 ==============================================================================

小开

如上所述，可以使用 sklearn 库处理多个线性回归。我正在使用 Python 3.6的 Anaconda 安装。

创建模型如下:

from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X, y) # display coefficients print(regressor.coef_)

小开

下面是一个可供选择的基本方法:

from patsy import dmatrices import statsmodels.api as sm y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data) ### y_data is the name of the dependent variable in your data ### model_fit = sm.OLS(y,x) results = model_fit.fit() print(results.summary())

你也可以使用 sm.Logit或者 sm.Probit代替 sm.OLS等等。

小开

Scikit-learn 是一个 Python 机器学习库，它可以为您完成这项工作。只需将 skearn.line _ model 模块导入到脚本中即可。

使用 sklearn 在 Python 中查找多个线性回归的代码模板:

import numpy as np import matplotlib.pyplot as plt #to plot visualizations import pandas as pd # Importing the dataset df = pd.read_csv(<Your-dataset-path>) # Assigning feature and target variables X = df.iloc[:,:-1] y = df.iloc[:,-1] # Use label encoders, if you have any categorical variable from sklearn.preprocessing import LabelEncoder labelencoder = LabelEncoder() X['<column-name>'] = labelencoder.fit_transform(X['<column-name>']) from sklearn.preprocessing import OneHotEncoder onehotencoder = OneHotEncoder(categorical_features = ['<index-value>']) X = onehotencoder.fit_transform(X).toarray() # Avoiding the dummy variable trap X = X[:,1:] # Usually done by the algorithm itself #Spliting the data into test and train set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2) # Fitting the model from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) # Predicting the test set results y_pred = regressor.predict(X_test)

就是这样，你可以使用这段代码作为在任何数据集中实现多线性回归的模板。为了更好地理解示例，请访问: < a href = “ https://medium.com/data-science-where/line-return-is-simple-81a05da5e0e1”rel = “ nofollow noReferrer”> 线性回归

小开

找到类似这样的线性模型可以用 OpenTURNS来处理。

在 OpenTURNS 中，这是通过 LinearModelAlgorithm类完成的，该类根据数值样本创建一个线性模型。更具体地说，它建立了以下线性模型:

Y = a0 + a1.X1 + ... + an.Xn + ε,

其中误差 ε 是高斯的，均值和单位方差为零。假设你的数据在一个 csv 文件中，这里有一个简单的脚本来得到回归系数 ai:

from __future__ import print_function import pandas as pd import openturns as ot # Assuming the data is a csv file with the given structure # Y X1 X2 .. X7 df = pd.read_csv("./data.csv", sep="\s+") # Build a sample from the pandas dataframe sample = ot.Sample(df.values) # The observation points are in the first column (dimension 1) Y = sample[:, 0] # The input vector (X1,..,X7) of dimension 7 X = sample[:, 1::] # Build a Linear model approximation result = ot.LinearModelAlgorithm(X, Y).getResult() # Get the coefficients ai print("coefficients of the linear regression model = ", result.getCoefficients())

然后，您可以通过以下调用轻松获得置信区间:

# Get the confidence intervals at 90% of the ai coefficients print( "confidence intervals of the coefficients = ", ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9), )

您可以在 OpenTURNS 示例中找到更详细的示例。

小开

试试高斯家族的广义线性模式

y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8]) X = np.array([ [-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54], [-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03], [-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03], [14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18], [4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28], [0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21], [0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55], ]) X=zip(*reversed(X)) df=pd.DataFrame({'X':X,'y':y}) columns=7 for i in range(0,columns): df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1) df=df.drop('X',axis=1) print(df) #model_formula='y ~ X0+X1+X2+X3+X4+X5+X6' model_formula='y ~ X0' model_family = sm.families.Gaussian() model_fit = glm(formula = model_formula, data = df, family = model_family).fit() print(model_fit.summary()) # Extract coefficients from the fitted model wells_fit #print(model_fit.params) intercept, slope = model_fit.params # Print coefficients print('Intercept =', intercept) print('Slope =', slope) # Extract and print confidence intervals print(model_fit.conf_int()) df2=pd.DataFrame() df2['X0']=np.linspace(0.50,0.70,50) df3=pd.DataFrame() df3['X1']=np.linspace(0.20,0.60,50) prediction0=model_fit.predict(df2) #prediction1=model_fit.predict(df3) plt.plot(df2['X0'],prediction0,label='X0') plt.ylabel("y") plt.xlabel("X0") plt.show()

小开

线性回归是开始人工智能的一个很好的例子

下面是使用 Python 的多线性回归机器学习算法的一个很好的例子:

##### Predicting House Prices Using Multiple Linear Regression - @Y_T_Akademi #### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values. import pandas as pd ##### we use sklearn library in many machine learning calculations.. from sklearn import linear_model ##### we import out dataset: housepricesdataset.csv df = pd.read_csv("housepricesdataset.csv",sep = ";") ##### The following is our feature set: ##### The following is the output(result) data: ##### we define a linear regression model here: reg = linear_model.LinearRegression() reg.fit(df[['area', 'roomcount', 'buildingage']], df['price']) # Since our model is ready, we can make predictions now: # lets predict a house with 230 square meters, 4 rooms and 10 years old building.. reg.predict([[230,4,10]]) # Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building.. reg.predict([[230,6,0]]) # Now lets predict a house with 355 square meters, 3 rooms and 20 years old building reg.predict([[355,3,20]]) # You can make as many prediction as you want.. reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])

我的数据集如下: