分配熊猫数据框列 dtype

小开

最佳答案

从0.17开始，你必须使用显式的转换:

pd.to_datetime, pd.to_timedelta and pd.to_numeric

(正如下面提到的，没有更多的“魔术”，convert_objects已经在0.17中被否定)

df = pd.DataFrame({'x': {0: 'a', 1: 'b'}, 'y': {0: '1', 1: '2'}, 'z': {0: '2018-05-01', 1: '2018-05-02'}})


df.dtypes


x    object
y    object
z    object
dtype: object


df


x  y           z
0  a  1  2018-05-01
1  b  2  2018-05-02

您可以将这些应用于要转换的每一列:

df["y"] = pd.to_numeric(df["y"])
df["z"] = pd.to_datetime(df["z"])
df


x  y          z
0  a  1 2018-05-01
1  b  2 2018-05-02


df.dtypes


x            object
y             int64
z    datetime64[ns]
dtype: object

并确认 dtype 已更新。

旧的/反对的答案熊猫0.12-0.16: 你可以使用 convert_objects来推断更好的 dtype:

In [21]: df
Out[21]:
x  y
0  a  1
1  b  2


In [22]: df.dtypes
Out[22]:
x    object
y    object
dtype: object


In [23]: df.convert_objects(convert_numeric=True)
Out[23]:
x  y
0  a  1
1  b  2


In [24]: df.convert_objects(convert_numeric=True).dtypes
Out[24]:
x    object
y     int64
dtype: object

魔法! (很遗憾看到它被废弃了。)

小开

对于那些来自谷歌(等等)的人，比如我:

从0.17开始，convert_objects 就已经被弃用了——如果你使用它，你会得到这样一个警告:

FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

你应该这样做:

df =df.astype(np.float)

df["A"] =pd.to_numeric(df["A"])

小开

设置列类型的另一种方法是首先使用所需的类型构造一个数字记录数组，填充它，然后将其传递给 DataFrame 构造函数。

import pandas as pd import numpy as np x = np.empty((10,), dtype=[('x', np.uint8), ('y', np.float64)]) df = pd.DataFrame(x) df.dtypes -> x uint8 y float64

小开

面对与你类似的问题。在我的情况下，我有1000个来自思科日志的文件，我需要手动解析。

为了灵活处理字段和类型，我已经成功地使用 StringIO + read _ cvs 进行了测试，它确实接受 dtype 规范的 dict。

我通常将每个文件(5k-20k 行)放入一个缓冲区并动态创建 dtype 字典。

最后，我将这些数据框连接(使用绝对... 感谢0.19)到一个大型数据框中，并将其转储到 hdf5中。

差不多就是这样

import pandas as pd import io output = io.StringIO() output.write('A,1,20,31\n') output.write('B,2,21,32\n') output.write('C,3,22,33\n') output.write('D,4,23,34\n') output.seek(0) df=pd.read_csv(output, header=None, names=["A","B","C","D"], dtype={"A":"category","B":"float32","C":"int32","D":"float64"}, sep="," ) df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 4 columns): A 5 non-null category B 5 non-null float32 C 5 non-null int32 D 5 non-null float64 dtypes: category(1), float32(1), float64(1), int32(1) memory usage: 205.0 bytes None

不是很像蟒蛇，但是很有效

希望能有帮助。

JC

小开

您可以使用熊猫 DataFrame.astype(dtype, copy=True, raise_on_error=True, **kwargs)显式设置类型，并使用希望为 dtype设置的 dtype 传递字典

举个例子:

import pandas as pd wheel_number = 5 car_name = 'jeep' minutes_spent = 4.5 # set the columns data_columns = ['wheel_number', 'car_name', 'minutes_spent'] # create an empty dataframe data_df = pd.DataFrame(columns = data_columns) df_temp = pd.DataFrame([[wheel_number, car_name, minutes_spent]],columns = data_columns) data_df = data_df.append(df_temp, ignore_index=True)

你得到了

In [11]: data_df.dtypes Out[11]: wheel_number float64 car_name object minutes_spent float64 dtype: object

和

data_df = data_df.astype(dtype= {"wheel_number":"int64", "car_name":"object","minutes_spent":"float64"})

现在你可以看到它已经改变了

In [18]: data_df.dtypes Out[18]: wheel_number int64 car_name object minutes_spent float64

小开

最好使用类型化的 np.array，然后将数据和列名作为字典传递。

import numpy as np import pandas as pd # Feature: np arrays are 1: efficient, 2: can be pre-sized x = np.array(['a', 'b'], dtype=object) y = np.array([ 1 , 2 ], dtype=np.int32) df = pd.DataFrame({ 'x' : x, # Feature: column name is near data array 'y' : y, } )