Ruby method to remove accents from UTF-8 international characters

I am trying to create a 'normalized' copy of a string, to help reduce duplicate names in a database. The names contain many international characters (ie. accented letters), and I want to create a copy with the accents removed.

I did come across the method below, but cannot get it to work. I can't seem to find what the Unicode Hacks plugin is.

  # Utility method that retursn an ASCIIfied, downcased, and sanitized string.
# It relies on the Unicode Hacks plugin by means of String#chars. We assume
# $KCODE is 'u' in environment.rb. By now we support a wide range of latin
# accented letters, based on the Unicode Character Palette bundled inMacs.
def self.normalize(str)
n = str.chars.downcase.strip.to_s
n.gsub!(/[à áâãäåÄÄ?]/u,    'a')
n.gsub!(/æ/u,                  'ae')
n.gsub!(/[ÄÄ?]/u,                'd')
n.gsub!(/[çÄ?ÄÄ?Ä?]/u,          'c')
n.gsub!(/[èéêëÄ?Ä?Ä?Ä?Ä?]/u, 'e')
n.gsub!(/Æ?/u,                   'f')
n.gsub!(/[ÄÄ?Ä¡Ä£]/u,            'g')
n.gsub!(/[ĥħ]/,                'h')
n.gsub!(/[ììíîïīĩĭ]/u,     'i')
n.gsub!(/[įıijĵ]/u,           'j')
n.gsub!(/[ķĸ]/u,               'k')
n.gsub!(/[Å?ľĺļÅ?]/u,         'l')
n.gsub!(/[ñÅ?Å?Å?Å?Å?]/u,       'n')
n.gsub!(/[òóôõöøÅÅ?ÅÅ]/u,  'o')
n.gsub!(/Å?/u,                  'oe')
n.gsub!(/Ä?/u,                   'q')
n.gsub!(/[Å?Å?Å?]/u,             'r')
n.gsub!(/[Å?Å¡Å?ÅÈ?]/u,          's')
n.gsub!(/[ťţŧÈ?]/u,           't')
n.gsub!(/[ùúûüūůűŭũų]/u,'u')
n.gsub!(/ŵ/u,                   'w')
n.gsub!(/[ýÿŷ]/u,             'y')
n.gsub!(/[žżź]/u,             'z')
n.gsub!(/\s+/,                   ' ')
n.gsub!(/[^\sa-z0-9_-]/,          '')
n
end

Do I need to 'require' a particular library/gem? Or maybe someone could recommend another way to go about this.

I am not using Rails, nor do I plan on doing so.

37876 次浏览

So far the following is the only way I've been able to accomplish what I need:

str.tr(
"ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
"AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")

But using this feels very 'hackish', and I would love to find a better way.

I generally use I18n to handle this:

1.9.3p392 :001 > require "i18n"
=> true
1.9.3p392 :002 > I18n.transliterate("Hé les mecs!")
=> "He les mecs!"

The parameterize method could be a nice and simple solution to remove special characters in order to use the string as human readable identifier:

> "Françoise Isaïe".parameterize
=> "francoise-isaie"

If you are using rails:

"L'Oréal".parameterize(separator: ' ')

Solution:

DIACRITICS = [*0x1DC0..0x1DFF, *0x0300..0x036F, *0xFE20..0xFE2F].pack('U*')


def removeaccents(str)
str
.unicode_normalize(:nfd)
.tr(DIACRITICS, '')
.unicode_normalize(:nfc)
end

Example (before/after):

ÀÁÂÃÄÅàáâãäåĀāĂ㥹ạảÇçĆćĈĉĊċČčĎďÈÉÊËèéêểệễëĒēĔĕĖėĘęĚěẹĜĝĞğĠġĢģĤĥÌÍÎÏìíîïĨĩĪīĬĭĮįİıịỉĴĵĶķĸĹĺĻļĽľÑñŃńŅņŇňÒÓÔÕÖòóôộỗổõöŌōŎŏŐőọỏơởợỡŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųụưủửữựŴŵÝýÿŶŷŸŹźŻżŽžứừửựữốồộỗổờóợỏỡếềễểệẩẫấầậỳỹýỷỵặẵẳằắ
AAAAAAaaaaaaAaAaAaaaCcCcCcCcCcDdEEEEeeeeeeeEeEeEeEeEeeGgGgGgGgHhIIIIiiiiIiIiIiIiIıiiJjKkĸLlLlLlNnNnNnNnOOOOOooooooooOoOoOoooooooRrRrRrSsSsSsSsſTtTtUUUUuuuuUuUuUuUuUuUuuuuuuuWwYyyYyYZzZzZzuuuuuooooooooooeeeeeaaaaayyyyyaaaaa

Explanations:

  • Decompose the single-codepoint characters into their constituting codepoints characters (where applicable).
  • Remove the diacritical mark codepoints (Unicode 15.0.0 reference) found in the following blocks:
    • Combining Diacritical Marks Supplement (U+1DC0 → U+1DFF)
    • Combining Diacritical Marks (U+0300 → U+036F)
    • Combining Half Marks (U+FE20 → U+FE2F)
  • Recompose the characters.

Caveats:

  • While these diacritics are predominantly used for text, some of them can also be used with symbols. These symbols will see these diacritics removed when they shouldn't be.
  • Obscure codepoints such as subtending marks are not removed. Despite their naming, they are not treated as combining marks by the unicode reference but as format characters. An example is the arabic hamza above ◌ٔ (U+0654) that probably doesn't even get properly displayed in your browser.
  • Not a caveat per se but worth nothing: diacritics that are preceded by a space or a breaking space are also removed. They are displayed as standalone characters in some text-rendering software so it may be undesired.