PythonPandas-将一些列类型更改为类别

我将以下 CSV 文件输入到 iPython Notebook 中:

public = pd.read_csv("categories.csv")
public

我还将熊猫导入为 pd,numpy 导入为 np,matplotlib.pyplot 导入为 plt。下面是下列数据类型(下面是一个摘要-大约有100列)

In [36]:   public.dtypes
Out[37]:   parks          object
playgrounds    object
sports         object
roading        object
resident       int64
children       int64

我想把“公园”、“游乐场”、“体育”和“道路”改为类别(它们有类似的比例反应——每个列有不同类型的类似反应(例如,一个有“强烈同意”、“同意”等,另一个有“非常重要”、“重要”等) ,其余的为 int64。

我能够创建一个单独的数据框架 -public1-并使用以下代码将其中一列更改为类别类型:

public1 = {'parks': public.parks}
public1 = public1['parks'].astype('category')

然而,当我试图立即使用这个代码更改一个数字时,我失败了:

public1 = {'parks': public.parks,
'playgrounds': public.parks}
public1 = public1['parks', 'playgrounds'].astype('category')

尽管如此,我不想创建一个单独的数据框架,只有类别列。我希望他们改变在原来的数据框架。

我尝试了许多方法来实现这一点,然后尝试这里的代码: 熊猫: 更改列的数据类型..。

public[['parks', 'playgrounds', 'sports', 'roading']] = public[['parks', 'playgrounds', 'sports', 'roading']].astype('category')

得到如下错误:

 NotImplementedError: > 1 ndim Categorical are not supported at this time

有没有办法把“公园”、“游乐场”、“体育”、“道路”改成类别(这样就可以分析李克特量表的反应) ,而不去碰“居民”和“儿童”(以及其他94个字符串、 int + float) ?还有更好的办法吗?如果任何人有任何建议和/或反馈,我将非常感激... ... 我正在慢慢变成秃头,扯掉我的头发!

非常感谢。

编辑添加-我正在使用 Python 2.7。

144750 次浏览

Sometimes, you just have to use a for-loop:

for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')

As of pandas 0.19.0, What's New describes that read_csv supports parsing Categorical columns directly. This answer applies only if you're starting from read_csv otherwise, I think unutbu's answer is still best. Example on 10,000 records:

import pandas as pd
import numpy as np


# Generate random data, four category-like columns, two int columns
N=10000
categories = pd.DataFrame({
'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
'resident' : np.random.choice([1, 2, 3], size=N),
'children' : np.random.choice([0, 1, 2, 3], size=N)
})
categories.to_csv('categories_large.csv', index=False)

<0.19.0 (or >=19.0 without specifying dtype)

pd.read_csv('categories_large.csv').dtypes # inspect default dtypes


children        int64
parks          object
playgrounds    object
resident        int64
roading        object
sports         object
dtype: object

>=0.19.0

For mixed dtypes parsing as Categorical can be implemented by passing a dictionary dtype={'colname' : 'category', ...} in read_csv.

pd.read_csv('categories_large.csv', dtype={'parks': 'category',
'playgrounds': 'category',
'sports': 'category',
'roading': 'category'}).dtypes
children          int64
parks          category
playgrounds    category
resident          int64
roading        category
sports         category
dtype: object

Performance

A slight speed-up (local jupyter notebook), as mentioned in the release notes.

# unutbu's answer
%%timeit
public = pd.read_csv('categories_large.csv')
for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')
10 loops, best of 3: 20.1 ms per loop


# parsed during read_csv
%%timeit
category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
public = pd.read_csv('categories_large.csv', dtype=category_cols)
100 loops, best of 3: 14.3 ms per loop

You can use the pandas.DataFrame.apply method along with a lambda expression to solve this. In your example you could use

df[['parks', 'playgrounds', 'sports']].apply(lambda x: x.astype('category'))

I don't know of a way to execute this inplace, so typically I'll end up with something like this:

df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))

Obviously you can replace .select_dtypes with explicit column names if you don't want to select all of a certain datatype (although in your example it seems like you wanted all object types).

I found that using a for loop works well.

for col in ['col_variable_name_1', 'col_variable_name_2', ect..]:
dataframe_name[col] = dataframe_name[col].astype(float)

Jupyter Notebook

In my case, I had big Dataframe with many objects that I would like to convert it to category.

Therefore, what I did is I selected the object columns and filled anything that is NA to missing and then saved it in the original Dataframe as in

# Convert Object Columns to Categories
obj_df =df.select_dtypes(include=['object']).copy()
obj_df=obj_df.fillna('Missing')
for col in obj_df:
obj_df[col] = obj_df[col].astype('category')
df[obj_df.columns]=obj_df[obj_df.columns]
df.head()

I hope this might be a helpful resource for later reference

No need for loops, Pandas can do it directly now, just pass a list of columns you want to convert and Pandas will convert them all.

cols = ['parks', 'playgrounds', 'sports', 'roading']
public[cols] = public[cols].astype('category')

df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['c', 'd', 'e']})


>>     a  b
>>  0  a  c
>>  1  b  d
>>  2  c  e


df.dtypes
>> a    object
>> b    object
>> dtype: object


df[df.columns] = df[df.columns].astype('category')
df.dtypes
>> a    category
>> b    category
>> dtype: object

To make things easier. No apply. No map. No loop.

cols=data.select_dtypes(exclude='int').columns.to_list()
data[cols]=data[cols].astype('category')

Using list comprehension(avoiding loop), this would convert all colums with dtypes=object to dtypes=category. I've put 'df' as the dataframe to be more generic.

df[[col for col in df.columns if df[col].dtypes == object]].astype('category', copy=False)

In case you'd like to avoid "copy=False" argument for some reason (as python documentation tells us to be careful while using that), you may use the following line.

df[[col for col in df.columns if df[col].dtypes == object]] = df[[col for col in df.columns if df[col].dtypes == object]].astype('category')

This is my first answer on stack, so please be kind.