Python: 在 csv.DictReader 中跳过标有 # 的注释行

Csv. DictReader处理 CSV 文件非常好——但是我有一些带注释行的 CSV 文件(注释行开头用散列表示) ,例如:

# step size=1.61853
val0,val1,val2,hybridisation,temp,smattr
0.206895,0.797923,0.202077,0.631199,0.368801,0.311052,0.688948,0.597237,0.402763
-169.32,1,1.61853,2.04069e-92,1,0.000906546,0.999093,0.241356,0.758644,0.202382
# adaptation finished

Csv 模块 不包括任何跳过这些句子的方法

我可以很容易地做一些奇怪的事情,但是我想有一种很好的方法可以将 csv.DictReader包装到其他迭代器对象上,这些迭代器对象会进行预处理以丢弃这些行。

34390 次浏览

Actually this works nicely with filter:

import csv
fp = open('samples.csv')
rdr = csv.DictReader(filter(lambda row: row[0]!='#', fp))
for row in rdr:
print(row)
fp.close()

Good question. Python's CSV library lacks basic support for comments (not uncommon at the top of CSV files). While Dan Stowell's solution works for the specific case of the OP, it is limited in that # must appear as the first symbol. A more generic solution would be:

def decomment(csvfile):
for row in csvfile:
raw = row.split('#')[0].strip()
if raw: yield raw


with open('dummy.csv') as csvfile:
reader = csv.reader(decomment(csvfile))
for row in reader:
print(row)

As an example, the following dummy.csv file:

# comment
# comment
a,b,c # comment
1,2,3
10,20,30
# comment

returns

['a', 'b', 'c']
['1', '2', '3']
['10', '20', '30']

Of course, this works just as well with csv.DictReader().

Another way to read a CSV file is using pandas

Here's a sample code:

df = pd.read_csv('test.csv',
sep=',',     # field separator
comment='#', # comment
index_col=0, # number or label of index column
skipinitialspace=True,
skip_blank_lines=True,
error_bad_lines=False,
warn_bad_lines=True
).sort_index()
print(df)
df.fillna('no value', inplace=True) # replace NaN with 'no value'
print(df)

For this csv file:

a,b,c,d,e
1,,16,,55#,,65##77
8,77,77,,16#86,18#
#This is a comment
13,19,25,28,82

we will get this output:

       b   c     d   e
a
1    NaN  16   NaN  55
8   77.0  77   NaN  16
13  19.0  25  28.0  82
b   c         d   e
a
1   no value  16  no value  55
8         77  77  no value  16
13        19  25        28  82

Just posting the bugfix from @sigvaldm's solution.

def decomment(csvfile):
for row in csvfile:
raw = row.split('#')[0].strip()
if raw: yield row


with open('dummy.csv') as csvfile:
reader = csv.reader(decomment(csvfile))
for row in reader:
print(row)

A CSV line can contain "#" characters in quoted strings and is perfectly valid. The previous solution was cutting off strings containing '#' characters.

based on sigvaldm and Leonid

def is_comment(line):
return line.startswith('#')


def is_whitespace(line):
return line.isspace()


def decomment(csvfile):
for row in csvfile:
if is_comment(row) == False and is_whitespace(row) == False:
yield row


with open('dummy.csv') as csvfile:
reader = csv.reader(decomment(csvfile))
for row in reader:
print(row)