在Python中将Unicode转换为ASCII而不出错

小开

>>> u'aあä'.encode('ascii', 'ignore')
'a'

使用响应中相应的meta标记或Content-Type标头中的字符集，对返回的字符串进行解码，然后进行编码。

ABC0__方法接受针对错误的自定义处理程序。除ignore外，默认值为：

>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'a&#12354;&#228;'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'

参见https：//docs.python.org/3/library/stdtypes.html#str.encode

小开

如果您有ABC0__字符串，则可以使用字符串的.encode([encoding], [errors='strict'])方法来转换编码类型。

line = 'my big string'

line.encode('ascii', 'ignore')

有关在Python中处理ASCII和Unicode的更多信息，这是一个非常有用的站点：https：//docs.python.org/2/howto/unicode.html

小开

我在我所有的项目中都使用了这个辅助函数。如果它不能转换Unicode，它会忽略它。这与Django库有关，但通过一些研究，您可以绕过它。

from django.utils import encoding


def convert_unicode_to_string(x):
"""
>>> convert_unicode_to_string(u'ni\xf1era')
'niera'
"""
return encoding.smart_str(x, encoding='ascii', errors='ignore')

使用后，我不再收到任何Unicode错误。

小开

你写了“ ” “我想这意味着HTML在某处包含了一些格式错误的Unicode尝试。” "

HTML不应包含任何类型的“尝试使用Unicode ”，无论格式是否正确。它必须包含以某种编码方式编码的Unicode字符，这种编码方式通常是预先提供的。查找“字符集”。

您似乎认为字符集是UTF-8..基于什么理由？错误消息中显示的“\xa0 ”字节表示您可能有一个单字节字符集，例如CP1252。

如果您无法理解HTML开头的声明，请尝试使用夏尔代来找出可能的编码。

你为什么用“ regex ”来标记你的问题？

更新在你用一个非问题替换了你的整个问题之后：

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.


html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

小开

最佳答案

2018年更新：

截至2018年2月，使用gzip等压缩已成为相当受欢迎（约73%的网站使用它，包括大型网站如Google、YouTube、Yahoo、Wikipedia、Reddit、Stack Overflow和Stack Exchange Network站点）。
如果你做一个简单的解码，就像在原始答案中使用gzipped响应一样，你会得到一个类似这样的错误：

UnicodeDecodeError：“ UTF8 ”编解码器无法解码位置1中的字节0x8B：意外的代码字节

为了解码gzpipped响应，您需要添加以下模块（在Python 3中）：

import gzip
import io

注意：在Python2中，您将使用__ABC0而不是io

然后你可以像这样解析内容：

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

此代码读取响应，并将字节放入缓冲区中。然后，gzip模块使用GZipFile函数读取缓冲区。之后，gzip压缩的文件可以再次被读取为字节，并最终解码为正常可读的文本。

2010年的原始答案：

我们可以得到用于link的实际值吗？

此外，当我们试图从已经编码的字节串中.encode()时，我们通常会遇到这个问题。所以你可以先试着解码它

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

举个例子：

html = '\xa0'
encoded_str = html.encode("utf8")

失败于

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

同时：

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

成功，无错误。请注意，“ Windows-1252 ”是我用作例子的东西。我从夏尔代中得到了这个，它有0.5的信心是正确的！（好吧，就像给出的1个字符长度的字符串一样，您期望什么呢？）您应该将从.urlopen().read()返回的字节字符串的编码更改为应用于您检索的内容的编码。

我看到的另一个问题是，.encode()字符串方法返回修改后的字符串，而不是就地修改源代码。因此，self.response.out.write(html)是无用的，因为HTML不是来自HTML.encode的编码字符串（如果这是您最初的目标）。

正如Ignacio所建议的，检查源网页以获取从read()返回的字符串的实际编码。它要么在其中一个元标记中，要么在响应的ContentType头中。然后将其用作.decode()的参数。

但是，请注意，不应假定其他开发人员有足够的责任来确保头和/或元字符集声明与实际内容匹配。（这是一个皮塔饼，是的，我应该知道，我以前__过ABC0）。

小开

作为Ignacio Vazquez-Abrams回答的延伸

>>> u'aあä'.encode('ascii', 'ignore')
'a'

有时需要从字符中删除重音并打印基本形式。这可以通过以下方式实现

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

您可能还希望将其他字符（如标点符号）转换为最接近的等效项，例如，右单引号Unicode字符在编码时不会转换为ASCII撇号。

>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

尽管有更有效的方法来实现这一点。有关Python'；s"；此Unicode"；的最佳ASCII在哪里？数据库？的更多详细信息，请参阅此问题

小开

对于cmd.exe和HTML输出等损坏的控制台，您始终可以使用：

my_unicode_string.encode('ascii','xmlcharrefreplace')

这将保留所有非ASCII字符，同时使它们可以以纯ASCII__HTML中的ABC0打印。

警告：如果您在生产代码中使用它来避免错误，那么很可能是您的代码中有错误。。唯一有效的用例是打印到非Unicode控制台或轻松转换到HTML上下文中的HTML实体。

最后，如果您使用的是Windows并使用cmd.exe，则可以键入chcp 65001以启用UTF-8输出（使用Lucida控制台字体）。您可能需要添加myUnicodeString.encode('utf8')。

小开

unicodestring = '\xa0'


decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')

对我有用

小开

使用单码-它甚至可以立即将奇怪的字符转换为ASCII，甚至将中文转换为语音ASCII.

$ pip install unidecode

然后:

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

小开

看起来您使用的是Python 2.X. Python2.X默认使用ASCII，它不知道Unicode.因此例外.

只需将下面的线粘贴在Shebang之后，它就会起作用

# -*- coding: utf-8 -*-

小开

我认为答案是存在的，但只是零零碎碎的，这使得很难快速解决问题，如

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

让我们举个例子，假设我有一个文件，其中包含以下形式的数据（包含ASCII和非ASCII字符）

1/10/17，21：36-土地：欢迎

我们只想忽略和保留ASCII字符。

此代码将执行以下操作：

import unicodedata
fp  = open(<FILENAME>)
for line in fp:
rline = line.strip()
rline = unicode(rline, "utf-8")
rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
if len(rline) != 0:
print rline

类型（Rline）将为您提供

>type(rline)
<type 'str'>

小开

您可以使用以下代码作为示例来避免Unicode转换为ASCII错误：

from anyascii import anyascii


content = "Base Rent for – CC# 2100 Acct# 8410: $41,667.00 – PO – Lines - for Feb to Dec to receive monthly"
content = anyascii(content)
print(content)