在Python Unicode字符串中删除重音（规范化）的最佳方法是什么？

小开

我在网上找到了这个答案：

import unicodedata


def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii

它运行良好（例如，对于法语），但我认为第二步（删除重音）可以比删除非ASCII字符更好地处理，因为这对于某些语言（例如希腊语）会失败。最好的解决方案可能是显式删除标记为变音符号的Unicode字符。

编辑：这就是诀窍：

import unicodedata


def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

如果字符c可以与前面的字符组合，则unicodedata.combining(c)将返回true，这主要是如果它是变音符号。

编辑2：remove_accents需要一个Unicode字符串，而不是字节字符串。如果你有一个字节字符串，那么你必须像这样将其解码为Unicode字符串：

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

小开

最佳答案

这个怎么样：

import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')

这也适用于希腊字母：

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>

字符类别“Mn”代表Nonspacing_Mark，类似于MiniQuark答案中的unicodedata.combining（我没有想到unicodedata.combining，但它可能是更好的解决方案，因为它更明确）。

请记住，这些操作可能会显着改变文本的含义。重音，修音等不是“装饰”。

小开

Unidecode是正确答案。它将任何Unicode字符串音译为ascii文本中最接近的可能表示。

示例：

>>> from unidecode import unidecode
>>> unidecode('kožušček')
'kozuscek'
>>> unidecode('北亰')
'Bei Jing '
>>> unidecode('François')
'Francois'

小开

这不仅处理重音，还处理“笔画”（如ø等）：

import unicodedata as ud


def rmdiacritics(char):
'''
Return the base character of char, by "removing" any
diacritics like accents or curls and strokes and the like.
'''
desc = ud.name(char)
cutoff = desc.find(' WITH ')
if cutoff != -1:
desc = desc[:cutoff]
try:
char = ud.lookup(desc)
except KeyError:
pass  # removing "WITH ..." produced an invalid name
return char

这是我能想到的最优雅的方式（Alexis在本页的评论中提到过），尽管我认为它确实不是很优雅。事实上，正如评论中指出的那样，这更像是一种黑客攻击，因为Unicode名称实际上只是名称，它们不能保证一致性或任何东西。

仍然有一些特殊的字母不能用它来处理，比如翻转和倒置的字母，因为它们的Unicode名称不包含“with”。这取决于你想做什么。我有时需要重音剥离来实现字典排序顺序。

编辑说明：

结合评论中的建议（处理查找错误，Python-3代码）。

小开

对于@MiniQuark的回答：

我试图读取一个半法语（包含重音）的csv文件以及一些最终成为整数和浮点数的字符串。作为测试，我创建了一个test.txt文件，如下所示：

蒙特利尔，über，12.89， Mère， Françoise， noël，889

我必须包含2和3行才能使其工作（我在python票证中找到），并包含@Jabba的评论：

import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import csv
import unicodedata


def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])


with open('test.txt') as f:
read = csv.reader(f)
for row in read:
for element in row:
print remove_accents(element)

结果：

Montreal
uber
12.89
Mere
Francoise
noel
889

（注意：我在Mac OS X 10.8.4上并使用Python 2.7.3）

小开

实际上，我在项目兼容的python 2.6、2.7和3.4上工作，我必须从免费用户条目创建ID。

多亏了你，我创造了这个创造奇迹的功能。

import re
import unicodedata


def strip_accents(text):
"""
Strip accents from input String.


:param text: The input string.
:type text: String.


:returns: The processed String.
:rtype: String.
"""
try:
text = unicode(text, 'utf-8')
except (TypeError, NameError): # unicode is a default on python 3
pass
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
text = text.decode("utf-8")
return str(text)


def text_to_id(text):
"""
Convert input text to id.


:param text: The input string.
:type text: String.


:returns: The processed String.
:rtype: String.
"""
text = strip_accents(text.lower())
text = re.sub('[ ]+', '_', text)
text = re.sub('[^0-9a-zA-Z_-]', '', text)
return text

结果：

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

小开

一些语言将变音符号组合为语言字母和重音变音符号以指定重音。

我认为更安全的做法是明确地指定你想去掉哪些变数：

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
accents = set(map(unicodedata.lookup, accents))
chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
return unicodedata.normalize('NFC', ''.join(chars))

小开

gensim.utils.deaccent（文本）来自Gensim-人类主题建模：

'Sef chomutovskych komunistu dostal postou bily prasek'

另一种解决方案是unidecode。

请注意，使用Unicodedata的建议解决方案通常仅删除某些字符中的重音（例如，它将'ł'变为''，而不是'l'）。

小开

import unicodedata
from random import choice


import perfplot
import regex
import text_unidecode




def remove_accent_chars_regex(x: str):
return regex.sub(r'\p{Mn}', '', unicodedata.normalize('NFKD', x))




def remove_accent_chars_join(x: str):
# answer by MiniQuark
# https://stackoverflow.com/a/517974/7966259
return u"".join([c for c in unicodedata.normalize('NFKD', x) if not unicodedata.combining(c)])




perfplot.show(
setup=lambda n: ''.join([choice('Málaga François Phút Hơn 中文') for i in range(n)]),
kernels=[
remove_accent_chars_regex,
remove_accent_chars_join,
text_unidecode.unidecode,
],
labels=['regex', 'join', 'unidecode'],
n_range=[2 ** k for k in range(22)],
equality_check=None, relative_to=0, xlabel='str len'
)

小开

如果您希望获得类似于Elasticsearch的asciifolding过滤器的功能，您可能需要考虑折叠到ascii，它[本身]…

Apache Lucene ASCII折叠过滤器的Python端口，它将前127个ASCII字符（“基本拉丁”Unicode块）中没有的字母、数字和符号Unicode字符转换为ASCII等效字符（如果存在）。

下面是上面提到的页面中的一个示例：

from fold_to_ascii import fold
s = u'Astroturf® paté'
fold(s)
> u'Astroturf pate'
fold(s, u'?')
> u'Astroturf? pate'

编辑：fold_to_ascii模块似乎可以很好地标准化基于拉丁的字母；但是无法映射的字符被删除，这意味着该模块将减少中文文本，例如，空字符串。如果你想保留中文、日文和其他Unicode字母，请考虑使用@mo-han的remove_accent_chars_regex实现，如上所述。

小开

在我看来，建议的解决方案不应该被接受。最初的问题是要求口音的去除，所以正确的答案应该只做那个，而不是加上其他未指定的更改。

只需观察此代码的结果，即接受的答案。我将“马拉加”更改为“马拉加”：

accented_string = u'Málagueña'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaguena'and is of type 'str'

还有一个额外的更改（ñ->n），在OQ中没有请求。

执行请求任务的简单函数，较低形式：

def f_remove_accents(old):
"""
Removes common accent characters, lower form.
Uses: regex.
"""
new = old.lower()
new = re.sub(r'[àáâãäå]', 'a', new)
new = re.sub(r'[èéêë]', 'e', new)
new = re.sub(r'[ìíîï]', 'i', new)
new = re.sub(r'[òóôõö]', 'o', new)
new = re.sub(r'[ùúûü]', 'u', new)
return new

小开

这是一个简短的函数，它去掉了变音符号，但保留了非拉丁字符。大多数情况下（例如，"à"->"a"）由unicodedata（标准库）处理，但有些情况（例如，"æ"->"ae"）依赖于给定的并行字符串。

代码

from unicodedata import combining, normalize


LATIN = "ä  æ  ǽ  đ ð ƒ ħ ı ł ø ǿ ö  œ  ß  ŧ ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue"


def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))

NB.默认参数outliers被评估一次，并不意味着由调用者提供。

预期用途

作为以更“自然”的顺序对字符串列表进行排序的键：

sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)

输出：

['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']

如果你的字符串混合了文本和数字，你可能有兴趣用函数string_to_pairs()我给其他地方组成remove_diacritics()。

测试

要确保行为满足您的需求，请查看下面的pangrams：

examples = [
("hello, world", "hello, world"),
("42", "42"),
("你好，世界", "你好，世界"),
(
"Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.",
"des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.",
),
(
"Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.",
"falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.",
),
(
"Љубазни фењерџија чађавог лица хоће да ми покаже штос.",
"љубазни фењерџија чађавог лица хоће да ми покаже штос.",
),
(
"Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.",
"ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.",
),
(
"Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.",
"quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.",
),
(
"Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.",
"kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.",
),
(
"Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.",
"glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.",
)
]


for (given, expected) in examples:
assert remove_diacritics(given) == expected

病例保留变体

LATIN = "ä  æ  ǽ  đ ð ƒ ħ ı ł ø ǿ ö  œ  ß  ŧ ü  Ä  Æ  Ǽ  Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö  Œ  SS Ŧ Ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE"


def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))

小开

这里已经有很多答案，但这是以前没有考虑过的：使用skLearning

from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode


accented_string = u'Málagueña®'


print(strip_accents_unicode(accented_string)) # output: Malaguena®
print(strip_accents_ascii(accented_string)) # output: Malaguena

如果您已经在使用skLearning处理文本，这尤其有用。这些是计数器等类内部调用的函数来规范化字符串：使用strip_accents='ascii'时调用strip_accents_ascii，使用strip_accents='unicode'时调用strip_accents_unicode。

更多细节

最后，从它的文档字符串中考虑这些细节：

Signature: strip_accents_ascii(s)
Transform accentuated unicode symbols into ascii or nothing


Warning: this solution is only suited for languages that have a direct
transliteration to ASCII symbols.

和

Signature: strip_accents_unicode(s)
Transform accentuated unicode symbols into their simple counterpart


Warning: the python-level loop and join operations make this
implementation 20 times slower than the strip_accents_ascii basic
normalization.