使用 Unicode 的 Python 和正则表达式

我需要从字符串‘’中删除一些 Unicode 符号

我知道它们肯定存在，我试过了:

re.sub('([\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+)', '', 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ')

但是它不起作用，字符串保持不变，我做错了什么？

102481

小开

最佳答案

Are you using python 2.x or 3.0?

If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a raw string, with 'r'. Also, putting your entire pattern in parentheses is superfluous.

re.sub(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', '', ...)

http://docs.python.org/tutorial/introduction.html#unicode-strings

Edit:

It's also good practice to use the re.UNICODE/re.U/(?u) flag for unicode regexes, but it only affects character class aliases like \w or \b, of which this pattern does not use any and so would not be affected by.

>>> myre = re.compile(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', re.UNICODE) >>> myre <_sre.SRE_Pattern object at 0xb20b378> >>> mystr = u'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ' >>> result = myre.sub('', mystr) >>> len(mystr), len(result) (38, 22) >>> print result بسم الله الرحمن الرحيم