小开

您的文件实际上并不包含UTF-8编码的数据;它包含一些其他编码。找出该编码是什么，并在open调用中使用它。

例如，在windows - 1252编码中，0xe9将是字符é。

小开

最佳答案

作为Mark Ransom建议，我为这个问题找到了正确的编码。编码是"ISO-8859-1"，所以用open('u.item', encoding = "ISO-8859-1")替换open("u.item", encoding="utf-8")将解决问题。

小开

这是一个在Python 3中转换CSV文件的例子:

try:
inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
pass

小开

尝试使用熊猫来读取:

pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')

小开

如果你使用的是python2，下面是解决方案:

import io
for line in io.open("u.item", encoding="ISO-8859-1"):
# Do something

因为encoding参数不能与open()一起工作，你将得到以下错误:

'encoding'是此函数的无效关键字参数

小开

下面的方法对我也有用。ISO 8859 - 1将节省很多，主要是如果使用语音识别api。

例子:

file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")

小开

当使用open(filepath)(其中filepath实际上不是一个文件)时，有时也会得到相同的错误，所以首先确保你要打开的文件存在:

import os
assert os.path.isfile(filepath)

小开

你可以用以下方法来解决这个问题:

for line in open(your_file_path, 'rb'):

'rb'以二进制模式读取文件。阅读更多在这里。

小开

如此:

open('filename', encoding='latin-1')

或者:

open('filename', encoding="ISO-8859-1")

小开

你可以试试这种方法:

open('u.item', encoding='utf8', errors='ignore')

小开

用notepad++打开你的文件，选择“编码”;或“;Encodage"菜单来标识或从ANSI转换为UTF-8或ISO 8859 - 1代码页。

小开

所以网页的搜索速度更快对于google请求类似的问题(关于UTF-8的错误)，我把我的解决方法留给其他人。

我有问题。csv文件打开的描述:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 150: invalid continuation byte

我用记事本打开文件&第150位:这是一个西里尔字母的符号。我用“另存为”重新保存了那个文件。'命令，编码'UTF-8' &

.我的程序开始工作了

小开

基于另一个关于Stackoverflow的问题和本文之前的答案，我想添加一个帮助来找到正确的编码。

如果你的脚本运行在Linux操作系统上，你可以使用file命令获取编码:

file --mime-encoding <filename>

下面是一个python脚本来为你做这件事:

import sys
import subprocess


if len(sys.argv) < 2:
print("Usage: {} <filename>".format(sys.argv[0]))
sys.exit(1)


def find_encoding(fname):
"""Find the encoding of a file using file command
"""


# find fullname of file command
which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
if which_run.returncode != 0:
print("Unable to find 'file' command ({})".format(which_run.returncode))
return None


file_cmd = which_run.stdout.decode().replace('\n', '')


# run file command to get MIME encoding
file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
if file_run.returncode != 0:
print(file_run.stderr.decode(), file=sys.stderr)


# return  encoding name only
return file_run.stdout.decode().split()[1]


# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))

小开

我正在使用从Kaggle下载的数据集，而读取这个数据集时，它抛出了这个错误:

UnicodeDecodeError: 'utf-8' codec不能解码字节0xf1在位置 183:无效的延续字节

这就是我解决问题的方法。

import pandas as pd


pd.read_csv('top50.csv', encoding='ISO-8859-1')

小开

编码替换为编码= ' iso - 8859 - 1 '

For line in open('u。项”,编码= ' iso - 8859 - 1”):

print(线)

小开

使用这个，如果你直接从github或kaggle加载数据DF=pd.read_csv(文件，编码='ISO-8859-1')

小开

“utf-8”编解码器无法解码位置7044中的0xed字节:无效的延续字节

上述错误是由于编码造成的

解决方案:-使用" encoding='latin-1' "

参考:- https://pandas.pydata.org/docs/search.html?q=encoding

小开

我一直遇到这个错误，通常解决方案不是由encoding='utf-8'解决的，而是由engine='python'解决的，就像这样:

import pandas as pd


file = "c:\\path\\to_my\\file.csv"
df = pd.read_csv(file, engine='python')
df

文档的链接在这里:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

小开

在我的情况下，这个问题发生，因为我修改了一个excel文件(.xlsx)的扩展名直接到(.csv)文件直接…

解决方案是打开文件，然后将其保存为新的(.csv)文件(即文件->保存为->选择(.csv)扩展名并保存它。这对我很管用。

小开

我的问题类似于UTF-8文本被传递给Python脚本。

在我的例子中，它来自SQL Server机器学习服务中使用sp_execute_external_script的SQL。不管出于什么原因，VARCHAR数据似乎被作为UTF-8传递，而NVARCHAR数据被作为UTF-16传递。

由于在Python中无法指定默认编码，并且没有用户可编辑的Python语句解析数据，所以我不得不在SELECT查询中的@input_data参数中使用SQL CONVERT()函数。

当这个查询

EXEC sp_execute_external_script @language = N'Python',
@script = N'
OutputDataSet = InputDataSet
',
@input_data_1 = N'SELECT id, text FROM the_error;'
WITH RESULT SETS (([id] int, [text] nvarchar(max)));

给出错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 0: unexpected end of data

使用CONVERT(type, data) (CAST(data AS type)也可以)

EXEC sp_execute_external_script @language = N'Python',
@script = N'
OutputDataSet = InputDataSet
',
@input_data_1 = N'SELECT id, CONVERT(NVARCHAR(max), text) FROM the_error;'
WITH RESULT SETS (([id] INT, [text] NVARCHAR(max)));

返回

id  text
1   Ç

for line in…utf-8'编解码器不能解码字节

print(线)