CSV reader (Python)中的“ Line include NULL byte”

小开

I'm guessing you have a NUL byte in input.csv. You can test that with

if '\0' in open('input.csv').read():
print "you have null bytes in your input file"
else:
print "you don't"

if you do,

reader = csv.reader(x.replace('\0', '') for x in mycsv)

may get you around that. Or it may indicate you have utf16 or something 'interesting' in the .csv file.

小开

This will tell you what line is the problem.

import csv


lines = []
with open('output.txt','r') as f:
for line in f.readlines():
lines.append(line[:-1])


with open('corrected.csv','w') as correct:
writer = csv.writer(correct, dialect = 'excel')
with open('input.csv', 'r') as mycsv:
reader = csv.reader(mycsv)
try:
for i, row in enumerate(reader):
if row[0] not in lines:
writer.writerow(row)
except csv.Error:
print('csv choked on line %s' % (i+1))
raise

Perhaps this from daniweb would be helpful:

I'm getting this error when reading from a csv file: "Runtime Error! line contains NULL byte". Any idea about the root cause of this error?

...

Ok, I got it and thought I'd post the solution. Simply yet caused me grief... Used file was saved in a .xls format instead of a .csv Didn't catch this because the file name itself had the .csv extension while the type was still .xls

小开

最佳答案

I've solved a similar problem with an easier solution:

import codecs
csvReader = csv.reader(codecs.open('file.csv', 'rU', 'utf-16'))

The key was using the codecs module to open the file with the UTF-16 encoding, there are a lot more of encodings, check the documentation.

小开

You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.

See the (line.replace('\0','') for line in f) below, also you'll want to probably open that file up using mode rb.

import csv


lines = []
with open('output.txt','r') as f:
for line in f.readlines():
lines.append(line[:-1])


with open('corrected.csv','w') as correct:
writer = csv.writer(correct, dialect = 'excel')
with open('input.csv', 'rb') as mycsv:
reader = csv.reader( (line.replace('\0','') for line in mycsv) )
for row in reader:
if row[0] not in lines:
writer.writerow(row)

小开

I've recently fixed this issue and in my instance it was a file that was compressed that I was trying to read. Check the file format first. Then check that the contents are what the extension refers to.

小开

Turning my linux environment into a clean complete UTF-8 environment made the trick for me. Try the following in your command line:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

小开

A tricky way:

If you develop under Lunux, you can use all the power of sed:

from subprocess import check_call, CalledProcessError


PATH_TO_FILE = '/home/user/some/path/to/file.csv'


try:
check_call("sed -i -e 's|\\x0||g' {}".format(PATH_TO_FILE), shell=True)
except CalledProcessError as err:
print(err)

The most efficient solution for huge files.

Checked for Python3, Kubuntu

小开

If you want to replace the nulls with something you can do this:

def fix_nulls(s):
for line in s:
yield line.replace('\0', ' ')


r = csv.reader(fix_nulls(open(...)))

小开

pandas.read_csv now handles the different UTF encoding when reading/writing and therefore can deal directly with null bytes

data = pd.read_csv(file, encoding='utf-16')

see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

小开

This is long settled, but I ran across this answer because I was experiencing an unexpected error while reading a CSV to process as training data in Keras and TensorFlow.

In my case, the issue was much simpler, and is worth being conscious of. The data being produced into the CSV wasn't consistent, resulting in some columns being completely missing, which seems to end up throwing this error as well.

The lesson: If you're seeing this error, verify that your data looks the way that you think it does!

小开

It is very simple.

don't make a csv file by "create new excel" or save as ".csv" from window.

simply import csv module, write a dummy csv file, and then paste your data in that.

csv made by python csv module itself will no longer show you encoding or blank line error.

小开

for skipping the NULL byte rows

import csv


with open('sample.csv', newline='') as csv_file:
reader = csv.reader(csv_file)
while True:
try:
row = next(reader)
print(row)
except csv.Error:
continue
except StopIteration:
break

小开

    def fix_nulls(s):
for line in s:
yield line.replace('\0', '')


with open(csv_file, 'r', encoding = "utf-8") as f:
reader = csv.reader(fix_nulls(f))
for line in reader:
#do something

this way works for me