巨蟒中的多线性回归

我似乎找不到任何可以进行多重回归的 Python 库。我找到的只有简单的回归。我需要回归我的因变量(y)对几个自变量(x1,x2,x3,等等)。

例如,使用这些数据:

print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
.format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)

(以上输出:)

      y        x1       x2       x3        x4     x5     x6       x7
-6.0     -4.95    -5.87    -0.76     14.73   4.02   0.20     0.45
-5.0     -4.55    -4.52    -0.71     13.74   4.47   0.16     0.50
-10.0    -10.96   -11.64    -0.98     15.49   4.18   0.19     0.53
-5.0     -1.08    -3.36     0.75     24.72   4.96   0.16     0.60
-8.0     -6.52    -7.45    -0.86     16.59   4.29   0.10     0.48
-3.0     -0.81    -2.36    -0.50     22.44   4.81   0.15     0.53
-6.0     -7.01    -7.33    -0.33     13.93   4.32   0.21     0.50
-8.0     -4.46    -7.65    -0.94     11.40   4.43   0.16     0.49
-8.0    -11.54   -10.03    -1.03     18.18   4.28   0.21     0.55

我该如何在 python 中回归这些线性回归,得到这个公式:

Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + + a7x7 + c

227948 次浏览

sklearn.linear_model.LinearRegression 会做到这一点:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
[t.y for t in texts])

然后 clf.coef_将有回归系数。

sklearn.linear_model 也有类似的界面对回归进行各种正则化。

你可以使用 Numpy.linalg.lstsq

下面是我创建的一个小工作,我用 R 检查了一下,它运行正确。

import numpy as np
import statsmodels.api as sm


y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]


x = [
[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
]


def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results

结果:

print reg_m(y, x).summary()

产出:

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.535
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     7.281
Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
Time:                        21:51:28   Log-Likelihood:                -26.025
No. Observations:                  23   AIC:                             60.05
Df Residuals:                      19   BIC:                             64.59
Df Model:                           3
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.2424      0.139      1.739      0.098        -0.049     0.534
x2             0.2360      0.149      1.587      0.129        -0.075     0.547
x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
const          1.5704      0.633      2.481      0.023         0.245     2.895


==============================================================================
Omnibus:                        6.904   Durbin-Watson:                   1.905
Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
Skew:                          -0.849   Prob(JB):                       0.0950
Kurtosis:                       4.426   Cond. No.                         38.6

pandas为运行 OLS 提供了一种方便的方法,如下所示:

用熊猫数据框运行 OLS 回归

你可以使用 Numpy.linalg.lstsq:

import numpy as np


y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
[
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
]
)
X = X.T  # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])]  # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(beta_hat)

结果:

[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]

您可以通过以下方式查看估计输出:

print(np.dot(X,beta_hat))

结果:

[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]

使用 scipy.optimize.curve_fit。而且不仅仅是线性拟合。

from scipy.optimize import curve_fit
import scipy


def fn(x, a, b, c):
return a + b*x[0] + c*x[1]


# y(x0,x1) data:
#    x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4


x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt

您可以使用下面的函数并将 DataFrame 传递给它:

def linear(x, y=None, show=True):
"""
@param x: pd.DataFrame
@param y: pd.DataFrame or pd.Series or None
if None, then use last column of x as y
@param show: if show regression summary
"""
import statsmodels.api as sm


xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()


if show: print res.summary()
return res

一旦你把你的数据转换成熊猫数据框架(df) ,

import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)

默认情况下包含拦截术语。

有关更多示例,请参见 这本笔记本

只是澄清一下,你给出的例子是 多个线性回归,而不是 多变量线性回归:

单个标量预测变量 x 和单个标量响应变量 y 的最简单的情况称为简单线性回归。对多值和/或向量值预测变量的扩展(用大写的 X 表示)被称为多线性回归,也被称为多变量线性回归。几乎所有现实世界的回归模型都涉及多个预测因子,对线性回归的基本描述通常用多元回归模型来表述。但是请注意,在这些情况下,响应变量 y 仍然是一个标量。另一个术语多变量线性回归是指 y 是向量的情况,也就是说,与一般线性回归相同。应该强调多元线性回归和多元线性回归之间的区别,因为它会在文献中造成很多混淆和误解。

简而言之:

  • 多个 线性回归: 响应 y 是一个标量。
  • 多变量线性回归: 响应 y 是一个向量。

(另一个 来源)

我认为这可能是完成这项工作最简单的方法:

from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4


print x.head()


x1        x2        x3  b
0  0.433681  0.946723  0.103422  1
1  0.400423  0.527179  0.131674  1
2  0.992441  0.900678  0.360140  1
3  0.413757  0.099319  0.825181  1
4  0.796491  0.862593  0.193554  1


print y.head()


0    6.637392
1    5.849802
2    7.874218
3    7.087938
4    7.102337
dtype: float64


model = OLS(y, x)
result = model.fit()
print result.summary()


OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.859e+30
Date:                Wed, 09 Dec 2015   Prob (F-statistic):               0.00
Time:                        15:17:32   Log-Likelihood:                 3224.9
No. Observations:                 100   AIC:                            -6442.
Df Residuals:                      96   BIC:                            -6431.
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             1.0000   8.98e-16   1.11e+15      0.000         1.000     1.000
x2             2.0000   8.28e-16   2.41e+15      0.000         2.000     2.000
x3             3.0000   8.34e-16    3.6e+15      0.000         3.000     3.000
b              4.0000   8.51e-16    4.7e+15      0.000         4.000     4.000
==============================================================================
Omnibus:                        7.675   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.022   Jarque-Bera (JB):                3.118
Skew:                           0.045   Prob(JB):                        0.210
Kurtosis:                       2.140   Cond. No.                         6.89
==============================================================================

如上所述,可以使用 sklearn 库处理多个线性回归。我正在使用 Python 3.6的 Anaconda 安装。

创建模型如下:

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)


# display coefficients
print(regressor.coef_)

下面是一个可供选择的基本方法:

from patsy import dmatrices
import statsmodels.api as sm


y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ###
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())

你也可以使用 sm.Logit或者 sm.Probit代替 sm.OLS等等。

Scikit-learn 是一个 Python 机器学习库,它可以为您完成这项工作。 只需将 skearn.line _ model 模块导入到脚本中即可。

使用 sklearn 在 Python 中查找多个线性回归的代码模板:

import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd


# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]


# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])


from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()


# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself


#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)


# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)


# Predicting the test set results
y_pred = regressor.predict(X_test)

就是这样,你可以使用这段代码作为在任何数据集中实现多线性回归的模板。 为了更好地理解示例,请访问: < a href = “ https://medium.com/data-science-where/line-return-is-simple-81a05da5e0e1”rel = “ nofollow noReferrer”> 线性回归

找到类似这样的线性模型可以用 OpenTURNS来处理。

在 OpenTURNS 中,这是通过 LinearModelAlgorithm类完成的,该类根据数值样本创建一个线性模型。更具体地说,它建立了以下线性模型:

Y = a0 + a1.X1 + ... + an.Xn + ε,

其中误差 ε 是高斯的,均值和单位方差为零。假设你的数据在一个 csv 文件中,这里有一个简单的脚本来得到回归系数 ai:

from __future__ import print_function
import pandas as pd
import openturns as ot


# Assuming the data is a csv file with the given structure
# Y X1 X2 .. X7
df = pd.read_csv("./data.csv", sep="\s+")


# Build a sample from the pandas dataframe
sample = ot.Sample(df.values)


# The observation points are in the first column (dimension 1)
Y = sample[:, 0]


# The input vector (X1,..,X7) of dimension 7
X = sample[:, 1::]


# Build a Linear model approximation
result = ot.LinearModelAlgorithm(X, Y).getResult()


# Get the coefficients ai
print("coefficients of the linear regression model = ", result.getCoefficients())

然后,您可以通过以下调用轻松获得置信区间:

# Get the confidence intervals at 90% of the ai coefficients
print(
"confidence intervals of the coefficients = ",
ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
)

您可以在 OpenTURNS 示例中找到更详细的示例。

试试高斯家族的广义线性模式

y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array([
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
])
X=zip(*reversed(X))


df=pd.DataFrame({'X':X,'y':y})
columns=7
for i in range(0,columns):
df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1)


df=df.drop('X',axis=1)
print(df)




#model_formula='y ~ X0+X1+X2+X3+X4+X5+X6'
model_formula='y ~ X0'


model_family = sm.families.Gaussian()
model_fit = glm(formula = model_formula,
data = df,
family = model_family).fit()


print(model_fit.summary())


# Extract coefficients from the fitted model wells_fit
#print(model_fit.params)
intercept, slope = model_fit.params


# Print coefficients
print('Intercept =', intercept)
print('Slope =', slope)


# Extract and print confidence intervals
print(model_fit.conf_int())


df2=pd.DataFrame()
df2['X0']=np.linspace(0.50,0.70,50)


df3=pd.DataFrame()
df3['X1']=np.linspace(0.20,0.60,50)


prediction0=model_fit.predict(df2)
#prediction1=model_fit.predict(df3)


plt.plot(df2['X0'],prediction0,label='X0')
plt.ylabel("y")
plt.xlabel("X0")
plt.show()

线性回归是开始人工智能的一个很好的例子

下面是使用 Python 的多线性回归机器学习算法的一个很好的例子:

##### Predicting House Prices Using Multiple Linear Regression - @Y_T_Akademi
    

#### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values.


import pandas as pd


##### we use sklearn library in many machine learning calculations..


from sklearn import linear_model


##### we import out dataset: housepricesdataset.csv


df = pd.read_csv("housepricesdataset.csv",sep = ";")


##### The following is our feature set:
##### The following is the output(result) data:
##### we define a linear regression model here:


reg = linear_model.LinearRegression()
reg.fit(df[['area', 'roomcount', 'buildingage']], df['price'])


# Since our model is ready, we can make predictions now:
# lets predict a house with 230 square meters, 4 rooms and 10 years old building..


reg.predict([[230,4,10]])


# Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building..
reg.predict([[230,6,0]])


# Now lets predict a house with 355 square meters, 3 rooms and 20 years old building
reg.predict([[355,3,20]])


# You can make as many prediction as you want..
reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])

我的数据集如下:

enter image description here