将分类数据传递到 Sklearn 决策树

有几篇关于如何将分类数据编码到 Sklearn 决策树的文章,但是从 Sklearn 的文档中,我们得到了这些

决策树的一些优点是:

(...)

能够处理数字和分类数据。其他技术通常专门用于分析只有一种类型变量的数据集。有关更多信息,请参见算法。

但是运行以下脚本

import pandas as pd
from sklearn.tree import DecisionTreeClassifier


data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']


tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])

输出以下错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b

我知道在 R 中有可能传递绝对数据,对于 Sklearn,有可能吗?

139218 次浏览

Sklearn Decision Trees do not handle conversion of categorical strings to numbers. I suggest you find a function in Sklearn (maybe this) that does so or manually write some code like:

def cat2int(column):
vals = list(set(column))
for i, string in enumerate(column):
column[i] = vals.index(string)
return column

Contrary to the accepted answer, I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.

Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder which is specially designed for this purpose.

Refer to the following code from the documentation:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])

This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform as follows:

list(le.inverse_transform([2, 2, 1]))

This would return ['tokyo', 'tokyo', 'paris'].

Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder class.

Hope this helps!

(..)

Able to handle both numerical and categorical data.

This only means that you can use

  • the DecisionTreeClassifier class for classification problems
  • the DecisionTreeRegressor class for regression.

In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier


data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']


tree = DecisionTreeClassifier()


one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)
tree.fit(one_hot_data, data['Class'])

(This is just a reformat of my comment above from 2016...it still holds true.)

The accepted answer for this question is misleading.

As it stands, sklearn decision trees do not handle categorical data - see issue #5442.

The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier() will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.

Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.

For nominal categorical variables, I would not use LabelEncoderbut sklearn.preprocessing.OneHotEncoder or pandas.get_dummies instead because there is usually no order in these type of variables.

As of v0.24.0, scikit supports the use of categorical features in HistGradientBoostingClassifier and HistGradientBoostingRegressor natively!

To enable categorical support, a boolean mask can be passed to the categorical_features parameter, indicating which feature is categorical. In the following, the first feature will be treated as categorical and the second feature as numerical:

>>> gbdt = HistGradientBoostingClassifier(categorical_features=[True, False])

Equivalently, one can pass a list of integers indicating the indices of the categorical features:

>>> gbdt = HistGradientBoostingClassifier(categorical_features=[0])

You still need to encode your strings, otherwise you will get "could not convert string to float" error. See here for an example on using OrdinalEncoder to convert strings to integers.

Yes decision tree is able to handle both numerical and categorical data. Which holds true for theoretical part, but during implementation, you should try either OrdinalEncoder or one-hot-encoding for the categorical features before training or testing the model. Always remember that ml models don't understand anything other than Numbers.

you can apply some conversion method like one hot encoding to transform your categorical data into numeric entities and then create the tree

Refer this URL for more information: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html