Logistic 回归: 未知标签类型: 在 python 中使用 sklearn 的“连续”

我有以下的代码来测试一些最流行的机器学习算法的 sklearn python 库:

import numpy as np
from sklearn                        import metrics, svm
from sklearn.linear_model           import LinearRegression
from sklearn.linear_model           import LogisticRegression
from sklearn.tree                   import DecisionTreeClassifier
from sklearn.neighbors              import KNeighborsClassifier
from sklearn.discriminant_analysis  import LinearDiscriminantAnalysis
from sklearn.naive_bayes            import GaussianNB
from sklearn.svm                    import SVC


trainingData    = np.array([ [2.3, 4.3, 2.5],  [1.3, 5.2, 5.2],  [3.3, 2.9, 0.8],  [3.1, 4.3, 4.0]  ])
trainingScores  = np.array( [3.4, 7.5, 4.5, 1.6] )
predictionData  = np.array([ [2.5, 2.4, 2.7],  [2.7, 3.2, 1.2] ])


clf = LinearRegression()
clf.fit(trainingData, trainingScores)
print("LinearRegression")
print(clf.predict(predictionData))


clf = svm.SVR()
clf.fit(trainingData, trainingScores)
print("SVR")
print(clf.predict(predictionData))


clf = LogisticRegression()
clf.fit(trainingData, trainingScores)
print("LogisticRegression")
print(clf.predict(predictionData))


clf = DecisionTreeClassifier()
clf.fit(trainingData, trainingScores)
print("DecisionTreeClassifier")
print(clf.predict(predictionData))


clf = KNeighborsClassifier()
clf.fit(trainingData, trainingScores)
print("KNeighborsClassifier")
print(clf.predict(predictionData))


clf = LinearDiscriminantAnalysis()
clf.fit(trainingData, trainingScores)
print("LinearDiscriminantAnalysis")
print(clf.predict(predictionData))


clf = GaussianNB()
clf.fit(trainingData, trainingScores)
print("GaussianNB")
print(clf.predict(predictionData))


clf = SVC()
clf.fit(trainingData, trainingScores)
print("SVC")
print(clf.predict(predictionData))

前两个工作正常,但是我在 LogisticRegression调用中得到了以下错误:

root@ubupc1:/home/ouhma# python stack.py
LinearRegression
[ 15.72023529   6.46666667]
SVR
[ 3.95570063  4.23426243]
Traceback (most recent call last):
File "stack.py", line 28, in <module>
clf.fit(trainingData, trainingScores)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/logistic.py", line 1174, in fit
check_classification_targets(y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/multiclass.py", line 172, in check_classification_targets
raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

输入数据与前面的调用相同,那么这里发生了什么?

顺便说一下,为什么 LinearRegression()SVR()算法 (15.72 vs 3.95)的第一次预测有很大的不同?

314131 次浏览

您将浮点数传递给一个分类器,该分类器期望将分类值作为目标向量。如果您将其转换为 int,它将被接受为输入(尽管如果这是正确的方法,这将是值得怀疑的)。

最好使用 scikit 的 labelEncoder函数来转换你的训练成绩。

The same is true for your DecisionTree and KNeighbors qualifier.

from sklearn import preprocessing
from sklearn import utils


lab_enc = preprocessing.LabelEncoder()
encoded = lab_enc.fit_transform(trainingScores)
>>> array([1, 3, 2, 0], dtype=int64)


print(utils.multiclass.type_of_target(trainingScores))
>>> continuous


print(utils.multiclass.type_of_target(trainingScores.astype('int')))
>>> multiclass


print(utils.multiclass.type_of_target(encoded))
>>> multiclass

在尝试将浮点数提供给分类器时,我也遇到了同样的问题。为了准确起见,我想保留浮点数而不是整数。尝试使用回归算法。例如:

import numpy as np
from sklearn import linear_model
from sklearn import svm


classifiers = [
svm.SVR(),
linear_model.SGDRegressor(),
linear_model.BayesianRidge(),
linear_model.LassoLars(),
linear_model.ARDRegression(),
linear_model.PassiveAggressiveRegressor(),
linear_model.TheilSenRegressor(),
linear_model.LinearRegression()]


trainingData    = np.array([ [2.3, 4.3, 2.5],  [1.3, 5.2, 5.2],  [3.3, 2.9, 0.8],  [3.1, 4.3, 4.0]  ])
trainingScores  = np.array( [3.4, 7.5, 4.5, 1.6] )
predictionData  = np.array([ [2.5, 2.4, 2.7],  [2.7, 3.2, 1.2] ])


for item in classifiers:
print(item)
clf = item
clf.fit(trainingData, trainingScores)
print(clf.predict(predictionData),'\n')

LogisticRegression不是为 退化而是为 分类

Y变量必须是分类类,

(例如 01)

而不是 continuous变量,

那就是 regression的问题了。