如何可靠地猜测 MacRoman、 CP1252、 Latin1、 UTF-8和 ASCII 之间的编码

在工作中,似乎没有一个星期不是编码相关的狂热、灾难或灾难。问题通常来自程序员,他们认为自己可以可靠地处理“文本”文件,而无需指定编码。但你不能。

因此,决定从今以后禁止文件的名称以 *.txt*.text结尾。这种想法是,那些扩展误导了随意的程序员,使他们在编码方面变得迟钝而自满,从而导致处理不当。最好还是不要 因为至少这样你就不知道你得到了什么。

然而,我们不会走那么远。相反,您将被要求使用以编码结尾的文件名。例如,对于文本文件来说,这些文件类似于 README.asciiREADME.latin1README.utf8等等。

对于需要特定扩展名的文件,如果可以在文件内部指定编码,比如在佩尔或 Python 中,那么就应该这样做。对于像 Java 源文件这样的文件,如果文件内部没有这样的工具,那么您将把编码放在扩展之前,比如 SomeClass-utf8.java

对于输出,UTF-8是 很强烈的首选。

但是对于输入,我们需要弄清楚如何处理代码库中名为 *.txt的数千个文件。我们想重新命名它们以适应我们的新标准。但我们不可能盯着他们所有人。所以我们需要一个实际工作的库或程序。

它们在 ASCII、 ISO-8859-1、 UTF-8、 Microsoft CP1252或 Apple MacRoman 中有不同的版本。虽然我们知道我们可以判断某些东西是否是 ASCII,并且我们知道某些东西是否可能是 UTF-8,但是我们对8位编码感到困惑。因为我们运行在一个混合的 Unix 环境(Solaris,Linux,Darwin)中,大多数台式机都是 Mac,所以我们有很多恼人的 MacRoman 文件。这些尤其是个问题。

一段时间以来,我一直在寻找一种方法,以编程的方式确定

  1. ASCII
  2. ISO-8859-1
  3. CP1252
  4. MacRoman
  5. UTF-8

一个文件,我还没有找到一个程序或库,可以可靠地区分这三种不同的8位编码。单单 MacRoman 文件就有一千多个,所以不管我们用什么字符集检测器,都能找出来。我看过的东西都不管用。我对 ICU 字符集检测器库寄予厚望,但它无法应付 MacRoman。我还研究了在 Perl 和 Python 中执行相同任务的模块,但是一次又一次都是相同的情况: 不支持检测 MacRoman。

因此,我要寻找的是一个现有的库或程序,它可靠地确定一个文件在这五种编码中的哪一种ーー最好不止这一种。特别是它必须区分我引用的三种3位编码,尤其是 MacRoman。这些文件99% 以上都是英文文本,有一些是其他语言的,但不是很多。

如果是库代码,我们首选的语言是佩尔、 c、 Java 或 Python,并且按照这个顺序。如果它只是一个程序,那么我们并不真正关心它使用的是什么语言,只要它是完全源代码的,运行在 Unix 上,并且完全没有阻碍。

还有其他人遇到过无数随机编码的遗留文本文件的问题吗?如果是这样,你是如何尝试解决这个问题的,你有多成功?这是我的问题中最重要的一个方面,但我也很感兴趣的是,您是否认为鼓励程序员使用这些文件的实际编码来命名(或重命名)他们的文件将有助于我们在未来避免这个问题。有没有人曾经试图在制度的基础上执行这一点,如果有,那个是否成功,为什么?

是的,我完全理解,鉴于问题的性质,人们为什么不能保证给出一个明确的答案。对于小文件来说尤其如此,因为您没有足够的数据继续下去。幸运的是,我们的文件很少小。除了随机的 README文件,大多数都在50k 到250k 的大小范围内,而且许多更大。任何大于几千英镑的东西都保证是用英语写的。

问题领域是生物医学文本挖掘,因此我们有时要处理大量的、极其庞大的语料库,比如 PubMedCentral 的所有开放存取资源库。一个相当大的文件是 BioThesaurus6.0,它有5.7 GB。这个文件特别烦人,因为它是 差不多所有的 UTF-8。然而,有些笨蛋在里面插了几行8位编码的代码ーー我相信是微软的 CP1252。你要花很长时间才会被那个绊倒。:(

17145 次浏览

My attempt at such a heuristic (assuming that you've ruled out ASCII and UTF-8):

  • If 0x7f to 0x9f don't appear at all, it's probably ISO-8859-1, because those are very rarely used control codes.
  • If 0x91 through 0x94 appear at lot, it's probably Windows-1252, because those are the "smart quotes", by far the most likely characters in that range to be used in English text. To be more certain, you could look for pairs.
  • Otherwise, it's MacRoman, especially if you see a lot of 0xd2 through 0xd5 (that's where the typographic quotes are in MacRoman).

Side note:

For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java

Do not do this!!

The Java compiler expects file names to match class names, so renaming the files will render the source code uncompilable. The correct thing would be to guess the encoding, then use the native2ascii tool to convert all non-ASCII characters to Unicode escape sequences.

As you have discovered, there is no perfect way to solve this problem, because without the implicit knowledge about which encoding a file uses, all 8-bit encodings are exactly the same: A collection of bytes. All bytes are valid for all 8-bit encodings.

The best you can hope for, is some sort of algorithm that analyzes the bytes, and based on probabilities of a certain byte being used in a certain language with a certain encoding will guess at what encoding the files uses. But that has to know which language the file uses, and becomes completely useless when you have files with mixed encodings.

On the upside, if you know that the text in a file is written in English, then the you're unlikely to notice any difference whichever encoding you decide to use for that file, as the differences between all the mentioned encodings are all localized in the parts of the encodings that specify characters not normally used in the English language. You might have some troubles where the text uses special formatting, or special versions of punctuation (CP1252 has several versions of the quote characters for instance), but for the gist of the text there will probably be no problems.

If you can detect every encoding EXCEPT for macroman, than it would be logical to assume that the ones that can't be deciphered are in macroman. In other words, just make a list of files that couldn't be processed and handle those as if they were macroman.

Another way to sort these files would be to make a server based program that allows users to decide which encoding isn't garbled. Of course, it would be within the company, but with 100 employees doing a few each day, you'll have thousands of files done in no time.

Finally, wouldn't it be better to just convert all existing files to a single format, and require that new files be in that format.

"Perl, C, Java, or Python, and in that order": interesting attitude :-)

"we stand a good change of knowing if something is probably UTF-8": Actually the chance that a file containing meaningful text encoded in some other charset that uses high-bit-set bytes will decode successfully as UTF-8 is vanishingly small.

UTF-8 strategies (in least preferred language):

# 100% Unicode-standard-compliant UTF-8
def utf8_strict(text):
try:
text.decode('utf8')
return True
except UnicodeDecodeError:
return False


# looking for almost all UTF-8 with some junk
def utf8_replace(text):
utext = text.decode('utf8', 'replace')
dodgy_count = utext.count(u'\uFFFD')
return dodgy_count, utext
# further action depends on how large dodgy_count / float(len(utext)) is


# checking for UTF-8 structure but non-compliant
# e.g. encoded surrogates, not minimal length, more than 4 bytes:
# Can be done with a regex, if you need it

Once you've decided that it's neither ASCII nor UTF-8:

The Mozilla-origin charset detectors that I'm aware of don't support MacRoman and in any case don't do a good job on 8-bit charsets especially with English because AFAICT they depend on checking whether the decoding makes sense in the given language, ignoring the punctuation characters, and based on a wide selection of documents in that language.

As others have remarked, you really only have the high-bit-set punctuation characters available to distinguish between cp1252 and macroman. I'd suggest training a Mozilla-type model on your own documents, not Shakespeare or Hansard or the KJV Bible, and taking all 256 bytes into account. I presume that your files have no markup (HTML, XML, etc) in them -- that would distort the probabilities something shocking.

You've mentioned files that are mostly UTF-8 but fail to decode. You should also be very suspicious of:

(1) files that are allegedly encoded in ISO-8859-1 but contain "control characters" in the range 0x80 to 0x9F inclusive ... this is so prevalent that the draft HTML5 standard says to decode ALL HTML streams declared as ISO-8859-1 using cp1252.

(2) files that decode OK as UTF-8 but the resultant Unicode contains "control characters" in the range U+0080 to U+009F inclusive ... this can result from transcoding cp1252 / cp850 (seen it happen!) / etc files from "ISO-8859-1" to UTF-8.

Background: I have a wet-Sunday-afternoon project to create a Python-based charset detector that's file-oriented (instead of web-oriented) and works well with 8-bit character sets including legacy ** n ones like cp850 and cp437. It's nowhere near prime time yet. I'm interested in training files; are your ISO-8859-1 / cp1252 / MacRoman files as equally "unencumbered" as you expect anyone's code solution to be?

First, the easy cases:

ASCII

If your data contains no bytes above 0x7F, then it's ASCII. (Or a 7-bit ISO646 encoding, but those are very obsolete.)

UTF-8

If your data validates as UTF-8, then you can safely assume it is UTF-8. Due to UTF-8's strict validation rules, false positives are extremely rare.

ISO-8859-1 vs. windows-1252

The only difference between these two encodings is that ISO-8859-1 has the C1 control characters where windows-1252 has the printable characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ. I've seen plenty of files that use curly quotes or dashes, but none that use C1 control characters. So don't even bother with them, or ISO-8859-1, just detect windows-1252 instead.

That now leaves you with only one question.

How do you distinguish MacRoman from cp1252?

This is a lot trickier.

Undefined characters

The bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D are not used in windows-1252. If they occur, then assume the data is MacRoman.

Identical characters

The bytes 0xA2 (¢), 0xA3 (£), 0xA9 (©), 0xB1 (±), 0xB5 (µ) happen to be the same in both encodings. If these are the only non-ASCII bytes, then it doesn't matter whether you choose MacRoman or cp1252.

Statistical approach

Count character (NOT byte!) frequencies in the data you know to be UTF-8. Determine the most frequent characters. Then use this data to determine whether the cp1252 or MacRoman characters are more common.

For example, in a search I just performed on 100 random English Wikipedia articles, the most common non-ASCII characters are ·•–é°®’èö—. Based on this fact,

  • The bytes 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, or 0xF6 suggest windows-1252.
  • The bytes 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, or 0xE1 suggest MacRoman.

Count up the cp1252-suggesting bytes and the MacRoman-suggesting bytes, and go with whichever is greatest.

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you?

I am currently writing a program that translates files into XML. It has to autodetect the type of each file, which is a superset of the problem of determining the encoding of a text file. For determining the encoding I am using a Bayesian approach. That is, my classification code computes a probability (likelihood) that a text file has a particular encoding for all the encodings it understands. The program then selects the most probable decoder. The Bayesian approach works like this for each encoding.

  1. Set the initial (prior) probability that the file is in the encoding, based on the frequencies of each encoding.
  2. Examine each byte in turn in the file. Look-up the byte value to determine the correlation between that byte value being present and a file actually being in that encoding. Use that correlation to compute a new (posterior) probability that the file is in the encoding. If you have more bytes to examine, use the posterior probability of that byte as the prior probability when you examine the next byte.
  3. When you get to the end of the file (I actually look at only the first 1024 bytes), the proability you have is the probability that the file is in the encoding.

It transpires that Bayes' theorem becomes very easy to do if instead of computing probabilities, you compute information content, which is the logarithm of the odds: info = log(p / (1.0 - p)).

You will have to compute the initail priori probability, and the correlations, by examining a corpus of files that you have manually classified.

Getting leaded by the accepted answer https://stackoverflow.com/a/4200765/2731103 I could improve the ruby gem "charlotte" to identify the requested encodings mostly correct.

We use that on productive environments for detecting CSV file encodings before import

That's the reasonable parts (Ruby)

UTF8HASBOM = /^\xEF\xBB\xBF/n      #  [239, 187, 191]
UTF32LEBOM = /^\xFF\xFE\x00\x00/n  # [255, 254, 0, 0]
UTF32BEBOM = /^\x00\x00\xFE\xFF/n  # [0, 0, 254, 255]


UTF16LEBOM = /^\xFF\xFE/n                # [255, 254]
UTF16BEBOM = /^\xFE\xFF/n                # [254, 255]


NOTIN1BYTE = /[\x00-\x06\x0B\x0E-\x1A\x1C-\x1F\x7F]/n
NOTISO8859 = /[\x00-\x06\x0B\x0E-\x1A\x1C-\x1F\x7F\x80-\x84\x86-\x9F]/n


# Information to identify MacRoman
# https://stackoverflow.com/questions/4198804/
NOTINCP1252 = /[\x81\x8D\x8F\x90\x9D]/n
CP1252CHARS = /[\x92\x95\x96\x97\xAE\xB0\xB7\xE8\xE9\xF6]/n
MCROMNCHARS = /[\x8E\x8F\x9A\xA1\xA5\xA8\xD0\xD1\xD5\xE1]/n
detect.force_encoding('BINARY') # Needed to prevent non-matching regex charset.
sample = detect[0..19]     # Keep sample string under 23 bytes.
detect.sub!(UTF8HASBOM, '') if sample[UTF8HASBOM] # Strip any UTF-8 BOM.


# See: http://www.daniellesucher.com/2013/07/23/ruby-case-versus-if/
if    sample.ascii_only? && detect.force_encoding('UTF-8').valid_encoding?


elsif sample[UTF32LEBOM] && detect.force_encoding('UTF-32LE').valid_encoding?
elsif sample[UTF32BEBOM] && detect.force_encoding('UTF-32BE').valid_encoding?
elsif sample[UTF16LEBOM] && detect.force_encoding('UTF-16LE').valid_encoding?
elsif sample[UTF16BEBOM] && detect.force_encoding('UTF-16BE').valid_encoding?


elsif detect.force_encoding('UTF-8').valid_encoding?


elsif detect.force_encoding('BINARY')[NOTISO8859].nil?
detect.force_encoding('ISO-8859-1')


elsif detect.force_encoding('BINARY')[NOTIN1BYTE].nil?


if  detect.force_encoding('BINARY')[NOTINCP1252].nil? &&
detect.force_encoding('BINARY').scan(MCROMNCHARS).length < detect.force_encoding('BINARY').scan(CP1252CHARS).length


detect.force_encoding('Windows-1252')
else
detect.force_encoding('MacRoman')
end


else  detect.force_encoding('BINARY')
end