用熊猫数据框运行 OLS 回归

我有一个 pandas数据框架,我希望能够从列 B 和 C 中的值预测列 A 的值。下面是一个玩具的例子:

import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50],
"B": [20, 30, 10, 40, 50],
"C": [32, 234, 23, 23, 42523]})

理想情况下,我会有一些类似于 ols(A ~ B + C, data = df)的东西,但是当我从像 scikit-learn这样的算法库中查看 例子时,它似乎用行列表而不是列表将数据提供给模型。这将需要我将数据重新格式化为列表内的列表,这似乎违背了使用大熊猫的首要目的。对熊猫数据框架中的数据运行 OLS 回归(或任何更一般的机器学习算法)的最简捷的方法是什么?

255817 次浏览

我认为你几乎可以做你认为理想的事情,使用 统计模型软件包,它是 pandas’0.20.0版本之前的 pandas’可选依赖项之一(它在 pandas.stats中用于一些事情)

>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept    14.952480
B             0.401182
C             0.000352
dtype: float64
>>> print(result.summary())
OLS Regression Results
==============================================================================
Dep. Variable:                      A   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.158
Method:                 Least Squares   F-statistic:                     1.375
Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421
Time:                        20:04:30   Log-Likelihood:                -18.178
No. Observations:                   5   AIC:                             42.36
Df Residuals:                       2   BIC:                             41.19
Df Model:                           2
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386
B              0.4012      0.650      0.617      0.600        -2.394     3.197
C              0.0004      0.001      0.650      0.583        -0.002     0.003
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.061
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498
Skew:                          -0.123   Prob(JB):                        0.780
Kurtosis:                       1.474   Cond. No.                     5.21e+04
==============================================================================


Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

注: pandas.stats 已经被移除了与0.20.0


使用 pandas.stats.ols做到这一点是可能的:

>>> from pandas.stats.api import ols
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> res = ols(y=df['A'], x=df[['B','C']])
>>> res
-------------------------Summary of Regression Analysis-------------------------


Formula: Y ~ <B> + <C> + <intercept>


Number of Observations:         5
Number of Degrees of Freedom:   3


R-squared:         0.5789
Adj R-squared:     0.1577


Rmse:             14.5108


F-stat (2, 2):     1.3746, p-value:     0.4211


Degrees of Freedom: model 2, resid 2


-----------------------Summary of Estimated Coefficients------------------------
Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
B     0.4012     0.6497       0.62     0.5999    -0.8723     1.6746
C     0.0004     0.0005       0.65     0.5826    -0.0007     0.0014
intercept    14.9525    17.7643       0.84     0.4886   -19.8655    49.7705
---------------------------------End of Summary---------------------------------

注意,您需要安装 statsmodels包,它由 pandas.stats.ols函数在内部使用。

这将需要我将数据重新格式化为列表内的列表,这似乎违背了使用大熊猫的首要目的。

不,它不是,只是转换成一个 NumPy 数组:

>>> data = np.asarray(df)

这需要不断的时间,因为它只是在你的数据上创建一个 风景,然后把它提供给 scikit-learn:

>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X, y = data[:, 1:], data[:, 0]
>>> lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> lr.coef_
array([  4.01182386e-01,   3.51587361e-04])
>>> lr.intercept_
14.952479503953672

我不知道这在 sklearnpandas中是否是新的,但是我能够直接将数据帧传递给 sklearn,而不需要将数据帧转换为数字数组或任何其他数据类型。

from sklearn import linear_model


reg = linear_model.LinearRegression()
reg.fit(df[['B', 'C']], df['A'])


>>> reg.coef_
array([  4.01182386e-01,   3.51587361e-04])

Statsmodel 可以构建一个 OLS 型号,其中列引用直接指向一个熊猫数据框架。

简短而甜蜜:

model = sm.OLS(df[y], df[x]).fit()


代码细节和回归总结:

# imports
import pandas as pd
import statsmodels.api as sm
import numpy as np


# data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))


# assign dependent and independent / explanatory variables
variables = list(df.columns)
y = 'A'
x = [var for var in variables if var not in y ]


# Ordinary least squares regression
model_Simple = sm.OLS(df[y], df[x]).fit()


# Add a constant term like so:
model = sm.OLS(df[y], sm.add_constant(df[x])).fit()


model.summary()

产出:

                            OLS Regression Results
==============================================================================
Dep. Variable:                      A   R-squared:                       0.019
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.9409
Date:                Thu, 14 Feb 2019   Prob (F-statistic):              0.394
Time:                        08:35:04   Log-Likelihood:                -484.49
No. Observations:                 100   AIC:                             975.0
Df Residuals:                      97   BIC:                             982.8
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         43.4801      8.809      4.936      0.000      25.996      60.964
B              0.1241      0.105      1.188      0.238      -0.083       0.332
C             -0.0752      0.110     -0.681      0.497      -0.294       0.144
==============================================================================
Omnibus:                       50.990   Durbin-Watson:                   2.013
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                6.905
Skew:                           0.032   Prob(JB):                       0.0317
Kurtosis:                       1.714   Cond. No.                         231.
==============================================================================

如何直接得到 R 的平方,系数和 p 值:

# commands:
model.params
model.pvalues
model.rsquared


# demo:
In[1]:
model.params
Out[1]:
const    43.480106
B         0.124130
C        -0.075156
dtype: float64


In[2]:
model.pvalues
Out[2]:
const    0.000003
B        0.237924
C        0.497400
dtype: float64


Out[3]:
model.rsquared
Out[2]:
0.0190

B 没有统计学意义。数据不能从中得出推论。C 的确影响 B 的概率

 df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})


avg_c=df['C'].mean()
sumC=df['C'].apply(lambda x: x if x<avg_c else 0).sum()
countC=df['C'].apply(lambda x: 1 if x<avg_c else None).count()
avg_c2=sumC/countC
df['C']=df['C'].apply(lambda x: avg_c2 if x >avg_c else x)
 

print(df)


model_ols = smf.ols("A ~ B+C",data=df).fit()


print(model_ols.summary())


df[['B','C']].plot()
plt.show()




df2=pd.DataFrame()
df2['B']=np.linspace(10,50,10)
df2['C']=30


df3=pd.DataFrame()
df3['B']=np.linspace(10,50,10)
df3['C']=100


predB=model_ols.predict(df2)
predC=model_ols.predict(df3)
plt.plot(df2['B'],predB,label='predict B C=30')
plt.plot(df3['B'],predC,label='predict B C=100')
plt.legend()
plt.show()


print("A change in the probability of C affects the probability of B")


intercept=model_ols.params.loc['Intercept']
B_slope=model_ols.params.loc['B']
C_slope=model_ols.params.loc['C']
#Intercept    11.874252
#B             0.760859
#C            -0.060257


print("Intercept {}\n B slope{}\n C    slope{}\n".format(intercept,B_slope,C_slope))




#lower_conf,upper_conf=np.exp(model_ols.conf_int())
#print(lower_conf,upper_conf)
#print((1-(lower_conf/upper_conf))*100)


model_cov=model_ols.cov_params()
std_errorB = np.sqrt(model_cov.loc['B', 'B'])
std_errorC = np.sqrt(model_cov.loc['C', 'C'])
print('SE: ', round(std_errorB, 4),round(std_errorC, 4))
#check for statistically significant
print("B z value {} C z value {}".format((B_slope/std_errorB),(C_slope/std_errorC)))
print("B feature is more statistically significant than C")




Output:


A change in the probability of C affects the probability of B
Intercept 11.874251554067563
B slope0.7608594144571961
C slope-0.060256845997223814


Standard Error:  0.4519 0.0793
B z value 1.683510336937001 C z value -0.7601036314930376
B feature is more statistically significant than C


z>2 is statistically significant