如何使熊猫 read_csv 中的分隔符更灵活的 wrt 空格,用于不规则分隔符?

我需要通过从文件中读取数据来创建一个数据帧,使用 read_csv方法。但是,分隔符不是很规则: 有些列用制表符(\t)分隔,有些用空格分隔。此外,有些列可以用2个或3个以上的空格分隔,甚至可以用空格和制表符的组合分隔(例如3个空格、2个制表符和1个空格)。

有没有办法告诉熊猫要好好对待这些档案?

顺便说一下,如果我使用 Python,就不会有这个问题。我使用:

for line in file(file_name):
fld = line.split()

而且效果很好。它不关心字段之间是否有2或3个空格。即使是空格和制表符的组合也不会造成任何问题。熊猫也能做到吗?

111121 次浏览

From the documentation, you can use either a regex or delim_whitespace:

>>> import pandas as pd
>>> for line in open("whitespace.csv"):
...     print repr(line)
...
'a\t  b\tc 1 2\n'
'd\t  e\tf 3 4\n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")
0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4
>>> pd.read_csv("whitespace.csv", header = None, sep = "\s+|\t+|\s+\t+|\t+\s+")

would use any combination of any number of spaces and tabs as the separator.

We may consider this to take care of all the combination and zero or more occurrences.

pd.read_csv("whitespace.csv", header = None, sep = "[ \t]*,[ \t]*")

Pandas has two csv readers, only is flexible regarding redundant leading white space:

pd.read_csv("whitespace.csv", skipinitialspace=True)

while one is not

pd.DataFrame.from_csv("whitespace.csv")

Neither is out-of-the-box flexible regarding trailing white space, see the answers with regular expressions. Avoid delim_whitespace, as it also allows just spaces (without , or \t) as separators.