是否有一套测试字符编码问题的“ Lorem ipsum”文件?

对于布局,我们有著名的“ Lorem ipsum”文本来测试它的外观。

我要找的是一组包含几种不同编码的 Text 的文件,我可以在 JUnit 测试中使用它们来测试一些在阅读文本文件时处理字符编码的方法。

例如:

具有 ISO 8859-1编码的测试文件和 Windows-1252编码的测试文件。Windows-1252必须触发区域8016-9F16的差异。换句话说,它必须至少包含这个区域的一个字符,以区别于 ISO8859-1。

也许最好的测试文件集是每种编码的测试文件只包含其所有字符一次。但也许我没有意识到-我们都喜欢这种编码的东西,对不对?:-)

是否存在针对字符编码问题的测试文件集?

9797 次浏览

The Wikipedia article on diacritics is pretty comprehensive, unfortunately you have to extract these characters manually. Also there might exist some mnemonics for each language. For instance in Polish we use:

Zażółć gęślą jaźń

which contains all 9 Polish diacritics in one correct sentence. Another useful search hint are pangrams: sentences using every letter of the alphabet at least once:

  • in Spanish, "El veloz murciélago hindú comía feliz cardillo y kiwi. La cigüeña tocaba el saxofón detrás del palenque de paja." (all 27 letters and diacritics).

  • in Russian, "Съешь же ещё этих мягких французских булок, да выпей чаю" (all 33 Russian Cyrillic alphabet letters).

List of pangrams contains an exhaustive summary. Anyone care to wrap this in a simple:

public interface NationalCharacters {
String spanish();
String russian();
//...
}

library?

Well, I had used an online tool to create my text char sets from Lorem Ipsum. I believe it can help you. I dont have one which has all the different charsets in a single page.

http://generator.lorem-ipsum.info/

I don't know of any complete text documents, but if you can start with a simple overview of all character sets there are some files available at the ftp.unicode.org server

Here's WINDOWS-1252 for example. The first column is the hexadecimal character value, and the second the unicode value.

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT

How about trying to use the ICU test suite files? I don't know if they are what you need for your test, but they seem to have pretty complete from/to UTF mapping files at least: Link to the repo for ICU test files

There are a few ready-to-use comprehensive unicode setups straight-forward downloadable.

From w3c

Here, there's a nice test file by w3.org including: maths, linguistics, Greek, Georgian, Russian, Thai, Runes, Braille among many others in a single file:

Coming from w3.org should be nice to use, shouldn't it?

Cutting out the HTML part

If you want to get the "original txt file" without risk of your editor messing it up, 1) download, 2) tail+head it, 3) Check with a diff:

wget https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html
tail +8 UTF-8-demo.html | head -n -3 > UTF-8-demo.txt
diff UTF-8-demo.html UTF-8-demo.txt

This generates a UTF-8-demo.txt without human intervention and without risk of loosing data.

More from w3c

There are many more files one level up in the directory structure, still inside the dir utf-8-test:

From github

There's a very interesting file here too with ALL printable chars (including Chinese, Braille, Arab, etc.)

Want also non printable characters?

There are also many more test files in the same repo:

and also a generator if you don't trust the committed file and you want to generate it by yourself.

My personal choice

I have decided that for my projects I'll start with 2 files: The specific one I pointed out from w3c and the specific one I pointed out from the github repo by bits.