如何让 python 解释器正确处理字符串运算中的非 ASCII 字符?

我有一个这样的字符串:

6 918 417 712

修剪这个字符串的简单方法(据我对 Python 的理解)就是说这个字符串在一个名为 s的变量中,我们得到:

s.replace('Â ', '')

这样应该可以了。当然,它抱怨文件 blabla.py 中的非 ASCII 字符 '\xc2'没有编码。

我一直不太明白如何在不同的编码之间切换。

这是代码,和上面的一样,但现在是在上下文中。该文件以 UTF-8格式保存在记事本中,并具有以下标头:

#!/usr/bin/python2.4
# -*- coding: utf-8 -*-

密码:

f = urllib.urlopen(url)


soup = BeautifulSoup(f)


s = soup.find('div', {'id':'main_count'})


#making a print 's' here goes well. it shows 6Â 918Â 417Â 712


s.replace('Â ','')


save_main_count(s)

没有比 s.replace更远的了。

213209 次浏览
s.replace(u'Â ', '')              # u before string is important

and make your .py file unicode.

>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'
#!/usr/bin/env python
# -*- coding: utf-8 -*-


s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "")
print s

This will print out 6 918 417 712

Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.

See: http://docs.python.org/tutorial/interpreter.html#source-code-encoding

To enable utf-8 source encoding, this would go in one of the top two lines:

# -*- coding: utf-8 -*-

The above is in the docs, but this also works:

# coding: utf-8

Additional considerations:

  • The source file must be saved using the correct encoding in your text editor as well.

  • In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.

  • s.replace(u"Â ", u"") will also fail if s is not a unicode string.

  • string.replace returns a new string and does not edit in place, so make sure you're using the return value as well

Throw out all characters that can't be interpreted as ASCII:

def remove_non_ascii(s):
return "".join(c for c in s if ord(c)<128)

Keep in mind that this is guaranteed to work with the UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).

This is a dirty hack, but may work.

s2 = ""
for i in s:
if ord(i) < 128:
s2 += i

Using Regex:

import re


strip_unicode = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')

I know it's an old thread, but I felt compelled to mention the translate method, which is always a good way to replace all character codes above 128 (or other if necessary).

Usage : str.translate(table[, deletechars])

>>> trans_table = ''.join( [chr(i) for i in range(128)] + [' '] * 128 )


>>> 'Résultat'.translate(trans_table)
'R sultat'
>>> '6Â 918Â 417Â 712'.translate(trans_table)
'6  918  417  712'

Starting with Python 2.6, you can also set the table to None, and use deletechars to delete the characters you don't want as in the examples shown in the standard docs at http://docs.python.org/library/stdtypes.html.

With unicode strings, the translation table is not a 256-character string but a dict with the ord() of relevant characters as keys. But anyway getting a proper ascii string from a unicode string is simple enough, using the method mentioned by truppo above, namely : unicode_string.encode("ascii", "ignore")

As a summary, if for some reason you absolutely need to get an ascii string (for instance, when you raise a standard exception with raise Exception, ascii_message ), you can use the following function:

trans_table = ''.join( [chr(i) for i in range(128)] + ['?'] * 128 )
def ascii(s):
if isinstance(s, unicode):
return s.encode('ascii', 'replace')
else:
return s.translate(trans_table)

The good thing with translate is that you can actually convert accented characters to relevant non-accented ascii characters instead of simply deleting them or replacing them by '?'. This is often useful, for instance for indexing purposes.

For what it was worth, my character set was utf-8 and I had included the classic "# -*- coding: utf-8 -*-" line.

However, I discovered that I didn't have Universal Newlines when reading this data from a webpage.

My text had two words, separated by "\r\n". I was only splitting on the \n and replacing the "\n".

Once I looped through and saw the character set in question, I realized the mistake.

So, it could also be within the ASCII character set, but a character that you didn't expect.

Way too late for an answer, but the original string was in UTF-8 and '\xc2\xa0' is UTF-8 for NO-BREAK SPACE. Simply decode the original string as s.decode('utf-8') (\xa0 displays as a space when decoded incorrectly as Windows-1252 or latin-1:

Example (Python 3)

s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE

Output

6 918 417 712
6 918 417 712
6_918_417_712
6-918-417-712

The following code will replace all non ASCII characters with question marks.

"".join([x if ord(x) < 128 else '?' for x in s])

my 2 pennies with beautiful soup,

string='<span style="width: 0px> dirty text begin ( ĀĒēāæśḍṣ <0xa0> ) dtext end </span></span>'
string=string.encode().decode('ascii',errors='ignore')
print(string)

will give

<span style="width: 0px> dirty text begin (   ) dtext end </span></span>