在熊猫中创建指定列类型的空数据框

我试图创建一个带索引的空数据框架,并指定列类型。我是这样做的:

df = pd.DataFrame(index=['pbp'],
columns=['contract',
'state_and_county_code',
'state',
'county',
'starting_membership',
'starting_raw_raf',
'enrollment_trend',
'projected_membership',
'projected_raf'],
dtype=['str', 'str', 'str', 'str',
'int', 'float', 'float',
'int', 'float'])

但是,我得到了下面的错误,

TypeError: data type not understood

这是什么意思?

143129 次浏览

熊猫不提供纯整数列。您可以根据需要使用 float 列并将该列转换为整数,也可以将其视为对象。你们试图实施的不是应该使用熊猫的方式。但是如果您真的希望这样做,您可以通过这样做来避开 TypeError 消息。

df1 =  pd.DataFrame(index=['pbp'], columns=['str1','str2','str2'], dtype=str)
df2 =  pd.DataFrame(index=['pbp'], columns=['int1','int2'], dtype=int)
df3 =  pd.DataFrame(index=['pbp'], columns=['flt1','flt2'], dtype=float)
df = pd.concat([df1, df2, df3], axis=1)


str1 str2 str2 int1 int2  flt1  flt2
pbp  NaN  NaN  NaN  NaN  NaN   NaN   NaN

您可以根据自己的喜好重新安排排列顺序。但是,熊猫不应该被这样使用。

 df.dtypes
str1     object
str2     object
str2     object
int1     object
int2     object
flt1    float64
flt2    float64
dtype: object

注意,int 被视为 object。

不工作 ,只是一句话。

You can get around the Type Error using np.dtype:

pd.DataFrame(index = ['pbp'], columns = ['a','b'], dtype = np.dtype([('str','float')]))

但你得到的是:

NotImplementedError: compound dtypes are not implementedin the DataFrame constructor

I found this question after running into the same issue. I prefer the following solution (Python 3) for creating an empty DataFrame with 没有索引.

import numpy as np
import pandas as pd


def make_empty_typed_df(dtype):
tdict = np.typeDict
types = tuple(tdict.get(t, t) for (_, t, *__) in dtype)
if any(t == np.void for t in types):
raise NotImplementedError('Not Implemented for columns of type "void"')
return pd.DataFrame.from_records(np.array([tuple(t() for t in types)], dtype=dtype)).iloc[:0, :]

测试一下..。

from itertools import chain


dtype = [('col%d' % i, t) for i, t in enumerate(chain(np.typeDict, set(np.typeDict.values())))]
dtype = [(c, t) for (c, t) in dtype if (np.typeDict.get(t, t) != np.void) and not isinstance(t, int)]


print(make_empty_typed_df(dtype))

退出:

Empty DataFrame


Columns: [col0, col6, col16, col23, col24, col25, col26, col27, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col95, col96, col97, col98, col99, col100, col101, col102, col103, col104, col105, col106, col107, col108, col109, col110, col111, col112, col113, col114, col115, col117, col119, col120, col121, col122, col123, col124, ...]
Index: []


[0 rows x 146 columns]

还有数据类型。

print(make_empty_typed_df(dtype).dtypes)

退出:

col0      timedelta64[ns]
col6               uint16
col16              uint64
col23                int8
col24     timedelta64[ns]
col25                bool
col26           complex64
col27               int64
col29             float64
col30                int8
col31             float16
col32              uint64
col33               uint8
col34              object
col35          complex128
col36               int64
col37               int16
col38               int32
col39               int32
col40             float16
col41              object
col42              uint64
col43              object
col44               int16
col45              object
col46               int64
col47               int16
col48              uint32
col49              object
col50              uint64
...
col144              int32
col145               bool
col146            float64
col147     datetime64[ns]
col148             object
col149             object
col150         complex128
col151    timedelta64[ns]
col152              int32
col153              uint8
col154            float64
col156              int64
col157             uint32
col158             object
col159               int8
col160              int32
col161             uint64
col162              int16
col163             uint32
col164             object
col165     datetime64[ns]
col166            float32
col167               bool
col168            float64
col169         complex128
col170            float16
col171             object
col172             uint16
col173          complex64
col174         complex128
dtype: object

添加索引很棘手,因为大多数数据类型都没有真正的缺失值,所以它们最终会被强制转换为其他具有本地缺失值的类型(例如,int被强制转换为 floatobject) ,但是如果您有指定类型的完整数据,那么您总是可以根据需要插入行,您的类型将得到尊重。这可以通过以下方式实现:

df.loc[index, :] = new_row

同样,正如@Hun 指出的,这不是熊猫的用途。

您可以通过将 dictionary 传递到 DataFrame 构造函数来实现这一点:

df = pd.DataFrame(index=['pbp'],
data={'contract' : np.full(1, "", dtype=str),
'starting_membership' : np.full(1, np.nan, dtype=float),
'projected_membership' : np.full(1, np.nan, dtype=int)
}
)

This will 正确 give you a dataframe that looks like:

     contract  projected_membership   starting_membership
pbp     ""             NaN           -9223372036854775808

使用 dtype:

contract                 object
projected_membership    float64
starting_membership       int64

也就是说,有两件事需要注意:

1) str isn't actually a type that a DataFrame column can handle; instead it falls back to the general case object. It'll still work properly.

2) Why don't you see NaN under starting_membership? Well, NaN is only defined for floats; there is no "None" value for integers, so it casts np.NaN to an integer. If you want a different default value, you can change that in the np.full call.

这闻起来真像虫子。

这里有另一个(更简单的)解决方案。

import pandas as pd
import numpy as np


def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df


df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
print(list(df.dtypes)) # int64, int64

这是一个老问题,但我没有看到一个可靠的答案(尽管@eric _ g 非常接近)。

您只需要使用键: 值对的字典创建一个空数据框架。键为列名,值为空数据类型。

因此,在您的示例数据集中,它看起来如下(Pandas 0.25和 python 3.7) :

variables = {'contract':'',
'state_and_county_code':'',
'state':'',
'county':'',
'starting_membership':int(),
'starting_raw_raf':float(),
'enrollment_trend':float(),
'projected_membership':int(),
'projected_raf':float()}


df = pd.DataFrame(variables, index=[])

在以前的熊猫版本中,人们可能不得不这样做:

df = pd.DataFrame(columns=[variables])

一种方法是:

import numpy
import pandas


dtypes = numpy.dtype(
[
("a", str),
("b", int),
("c", float),
("d", numpy.datetime64),
]
)
df = pandas.DataFrame(numpy.empty(0, dtype=dtypes))

我发现对我来说最简单的解决办法就是为每个列连接一个空序列列表:

import pandas as pd


columns = ['contract',
'state_and_county_code',
'state',
'county',
'starting_membership',
'starting_raw_raf',
'enrollment_trend',
'projected_membership',
'projected_raf']
dtype = ['str', 'str', 'str', 'str', 'int', 'float', 'float', 'int', 'float']
df = pd.concat([pd.Series(name=col, dtype=dt) for col, dt in zip(columns, dtype)], axis=1)
df.info()
# <class 'pandas.core.frame.DataFrame'>
# Index: 0 entries
# Data columns (total 9 columns):
# contract                 0 non-null object
# state_and_county_code    0 non-null object
# state                    0 non-null object
# county                   0 non-null object
# starting_membership      0 non-null int32
# starting_raw_raf         0 non-null float64
# enrollment_trend         0 non-null float64
# projected_membership     0 non-null int32
# projected_raf            0 non-null float64
# dtypes: float64(3), int32(2), object(4)
# memory usage: 0.0+ bytes

我的解决方案(不设置索引)是使用列名初始化数据框架,并使用 astype()方法指定数据类型。

df = pd.DataFrame(columns=['contract',
'state_and_county_code',
'state',
'county',
'starting_membership',
'starting_raw_raf',
'enrollment_trend',
'projected_membership',
'projected_raf'])
df = df.astype( dtype={'contract' : str,
'state_and_county_code': str,
'state': str,
'county': str,
'starting_membership': int,
'starting_raw_raf': float,
'enrollment_trend': float,
'projected_membership': int,
'projected_raf': float})

你可使用以下连结:

df = pd.DataFrame({'a': pd.Series(dtype='int'),
'b': pd.Series(dtype='str'),
'c': pd.Series(dtype='float')})

or more abstractly:

df = pd.DataFrame({c: pd.Series(dtype=t) for c, t in {'a': 'int', 'b': 'str', 'c': 'float'}.items()})

then if you call df you have:

>>> df
Empty DataFrame
Columns: [a, b, c]
Index: []

如果你检查它的类型:

>>> df.dtypes
a      int32
b     object
c    float64
dtype: object

在熊猫中创建指定列类型的空数据框:

import pandas as pd


c1 = pd.Series(data=None, dtype='string', name='c1')
c2 = pd.Series(data=None, dtype='bool', name='c2')
c3 = pd.Series(data=None, dtype='float', name='c3')
c4 = pd.Series(data=None, dtype='int', name='c4')


df = pd.concat([c1, c2, c3, c4], axis=1)


df.info('verbose')

我们将列创建为 Series 并给它们正确的 dtype,然后将 Series 连接到 DataFrame 中,就是这样

我们有带 dtype 的 DataFrame 构造函数!

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
#   Column  Non-Null Count  Dtype
---  ------  --------------  -----
0   c1      0 non-null      string
1   c2      0 non-null      bool
2   c3      0 non-null      float64
3   c4      0 non-null      int32
dtypes: bool(1), float64(1), int32(1), string(1)
memory usage: 0.0+ bytes

从示例中获取 list 列和 dtype,您可以执行以下操作:

cdt={i[0]: i[1] for i in zip(columns, dtype)}    # make column type dict
pdf=pd.DataFrame(columns=list(cdt))    # create empty dataframe
pdf=pdf.astype(cdt)                    # set desired column types

DataFrame 文档说在构造函数调用中只允许一个 dtype。

我建议这样做:

columns = ["a", "b"]
types = ['float32', 'str']
predefined_size = 10


df = pd.DataFrame({c: pd.Series(index=range(predefined_size), dtype=t)
for c,t in zip(columns, types)})

好处

  • 支援旧熊猫版本(例如0.19.2)
  • 可以同时初始化类型和大小

快速(est)和清除: 用 numpy ndarrays直接初始化

import numpy as np
import pandas as pd


df = pd.DataFrame(
{'a': np.ndarray((0,), dtype=int),
'b': np.ndarray((0,), dtype=str),
'c': np.ndarray((0,), dtype=float)
}
)
print(df.dtypes)

产量

a      int64
b     object
c    float64
dtype: object

performance benchmark

这也是做到这一点的最快方法,如下所示

Python 3.8.8 (default, Apr 13 2021, 19:58:26)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.


In [1]: import numpy as np


In [2]: import pandas as pd


In [3]: %timeit pd.DataFrame({'a': np.ndarray((0,), dtype=int), 'b': np.ndarray(
...: (0,), dtype=str), 'c': np.ndarray((0,), dtype=float)})


183 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [4]:


In [4]: def df_empty(columns, dtypes, index=None):
...:     assert len(columns)==len(dtypes)
...:     df = pd.DataFrame(index=index)
...:     for c,d in zip(columns, dtypes):
...:         df[c] = pd.Series(dtype=d)
...:     return df
...: %timeit df_empty(['a', 'b', 'c'], dtypes=[int, str, float])


1.14 ms ± 2.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [5]:


In [5]: %timeit pd.DataFrame({'a': pd.Series(dtype='int'), 'b': pd.Series(dtype='str'), 'c': pd.Series(dtype='float')})
564 µs ± 658 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)