熊猫 read_csv dtype 读取所有列,但很少作为字符串读取

我在用熊猫来读一堆 CSV。将一个选项 json 传递给 dtype 参数,以告诉 Pandas 哪些列作为字符串读取,而不是默认值:

dtype_dic= { 'service_id':str, 'end_date':str, ... }
feedArray = pd.read_csv(feedfile , dtype = dtype_dic)

在我的场景中,除了一些特定的列之外,所有人列都将作为字符串读取。因此,我不想在 dtype_dic中将几个列定义为 str,而是将我选择的几个列设置为 int 或 float。有办法吗?

这是一个循环,循环遍历不同列的 CSV,所以在读取了整个 csv 作为字符串(dtype=str)之后直接进行列转换并不容易,因为我不能立即知道 csv 具有哪些列。(我宁愿将这些精力花在定义 dtype json 中的所有列上!)

编辑: 但是如果有一种方法可以将列名转换为数字而不会出现错误,如果这个列没有出现在 csv 中,那么是的,这将是一个有效的解决方案,如果在 csv 阅读阶段本身没有其他方法可以做到这一点。

注意: 这听起来像是 就像之前问过的问题,但是这里的答案走的是一条非常不同的道路(与布尔相关) ,这并不适用于这个问题。请不要标记为重复!

190424 次浏览

EDIT - sorry, I misread your question. Updated my answer.

You can read the entire csv as strings then convert your desired columns to other types afterwards like this:

df = pd.read_csv('/path/to/file.csv', dtype=str)
# example df; yours will be from pd.read_csv() above
df = pd.DataFrame({'A': ['1', '3', '5'], 'B': ['2', '4', '6'], 'C': ['x', 'y', 'z']})
types_dict = {'A': int, 'B': float}
for col, col_type in types_dict.items():
df[col] = df[col].astype(col_type)

Another approach, if you really want to specify the proper types for all columns when reading the file in and not change them after: read in just the column names (no rows), then use those to fill in which columns should be strings

col_names = pd.read_csv('file.csv', nrows=0).columns
types_dict = {'A': int, 'B': float}
types_dict.update({col: str for col in col_names if col not in types_dict})
pd.read_csv('file.csv', dtype=types_dict)

I recently encountered the same issue, though I only have one csv file so I don't need to loop over files. I think this solution can be adapted into a loop as well.

Here I present a solution I used. Pandas' read_csv has a parameter called converters which overrides dtype, so you may take advantage of this feature.

An example code is as follows: Assume that our data.csv file contains all float64 columns except A and B which are string columns. You may read this file using:

df = pd.read_csv('data.csv', dtype = 'float64', converters = {'A': str, 'B': str})

The code gives warnings that converters override dtypes for these two columns A and B, and the result is as desired.

Regarding looping over several csv files all one needs to do is to figure out which columns will be exceptions to put in converters. This is easy if files have a similar pattern of column names, otherwise, it would get tedious.

You can do the following:

pd.read_csv(self._LOCAL_FILE_PATH,
index_col=0,
encoding="utf-8",
dtype={
'customer_id': 'int32',
'product_id': 'int32',
'subcategory_id': 'int16',
'category_id': 'int16',
'gender': 'int8',
'views': 'int8',
'purchased': 'int8',
'added': 'int8',
'time_on_page': 'float16',
})

Extending on @MECoskun's answer using converters and simultaneously striping leading and trailing white spaces, making converters more versatile:

df = pd.read_csv('data.csv', dtype = 'float64', converters = {'A': str.strip, 'B': str.strip})

There is also lstrip and rstrip that could be used if needed instead of strip. Note, do not use strip() but just strip. Of course, you do not strip non strings.