将列类型转换为字符串或分类

如何将熊猫数据框中的一列转换为字符串类型?在下面的存储数据 df 中,我需要将 zipcode 转换为 string,这样当我运行线性回归时,zipcode 就被视为绝对的,而不是数字的。谢谢!

df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}})
print (df)
bathrooms  bedrooms  floors  sqft_living  sqft_lot  zipcode
722         3.25         4     2.0         4670     51836    98005
2680        0.75         2     1.0         1440      3700    98107
14554       2.50         4     2.0         3180      9603    98155
17384       1.50         2     3.0         1430      1650    98125
18754       1.00         2     1.0         1130      2640    98109
233339 次浏览

To convert a column into a string type (that will be an object column per se in pandas), use astype:

df.zipcode = zipcode.astype(str)

If you want to get a Categorical column, you can pass the parameter 'category' to the function:

df.zipcode = zipcode.astype('category')

You need astype:

df['zipcode'] = df.zipcode.astype(str)
#df.zipcode = df.zipcode.astype(str)

For converting to categorical:

df['zipcode'] = df.zipcode.astype('category')
#df.zipcode = df.zipcode.astype('category')

Another solution is Categorical:

df['zipcode'] = pd.Categorical(df.zipcode)

Sample with data:

import pandas as pd


df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}})
print (df)
bathrooms  bedrooms  floors  sqft_living  sqft_lot  zipcode
722         3.25         4     2.0         4670     51836    98005
2680        0.75         2     1.0         1440      3700    98107
14554       2.50         4     2.0         3180      9603    98155
17384       1.50         2     3.0         1430      1650    98125
18754       1.00         2     1.0         1130      2640    98109


print (df.dtypes)
bathrooms      float64
bedrooms         int64
floors         float64
sqft_living      int64
sqft_lot         int64
zipcode          int64
dtype: object


df['zipcode'] = df.zipcode.astype('category')


print (df)
bathrooms  bedrooms  floors  sqft_living  sqft_lot zipcode
722         3.25         4     2.0         4670     51836   98005
2680        0.75         2     1.0         1440      3700   98107
14554       2.50         4     2.0         3180      9603   98155
17384       1.50         2     3.0         1430      1650   98125
18754       1.00         2     1.0         1130      2640   98109


print (df.dtypes)
bathrooms       float64
bedrooms          int64
floors          float64
sqft_living       int64
sqft_lot          int64
zipcode        category
dtype: object

With pandas >= 1.0 there is now a dedicated string datatype:

1) You can convert your column to this pandas string datatype using .astype('string'):

df['zipcode'] = df['zipcode'].astype('string')

2) This is different from using str which sets the pandas object datatype:

df['zipcode'] = df['zipcode'].astype(str)

3) For changing into categorical datatype use:

df['zipcode'] = df['zipcode'].astype('category')

You can see this difference in datatypes when you look at the info of the dataframe:

df = pd.DataFrame({
'zipcode_str': [90210, 90211] ,
'zipcode_string': [90210, 90211],
'zipcode_category': [90210, 90211],
})


df['zipcode_str'] = df['zipcode_str'].astype(str)
df['zipcode_string'] = df['zipcode_str'].astype('string')
df['zipcode_category'] = df['zipcode_category'].astype('category')


df.info()


# you can see that the first column has dtype object
# while the second column has the new dtype string
# the third column has dtype category
#   Column            Non-Null Count  Dtype
---  ------            --------------  -----
0   zipcode_str       2 non-null      object
1   zipcode_string    2 non-null      string
2   zipcode_category  2 non-null      category
dtypes: category(1), object(1), string(1)

From the docs:

The 'string' extension type solves several issues with object-dtype NumPy arrays:

  1. You can accidentally store a mixture of strings and non-strings in an object dtype array. A StringArray can only store strings.

  2. object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text, but still object-dtype columns.

  3. When reading code, the contents of an object dtype array is less clear than string.

More info on working with the new string datatype can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

Prior answers focused on nominal data (e.g. unordered). If there is a reason to impose order for an ordinal variable, then one would use:

# Transform to category
df['zipcode_category'] = df['zipcode_category'].astype('category')


# Add ordered category
df['zipcode_ordered'] = df['zipcode_category']


# Setup the ordering
df.zipcode_ordered.cat.set_categories(
new_categories = [90211, 90210], ordered = True, inplace = True
)


# Output IDs
df['zipcode_ordered_id'] = df.zipcode_ordered.cat.codes
print(df)
#  zipcode_category zipcode_ordered  zipcode_ordered_id
#            90210           90210                   1
#            90211           90211                   0

More details on setting ordered categories can be found at the pandas website:

https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#sorting-and-order