import java.io.*;
public class Foo {
private static final String BOM = "\ufeff";
private static final String TEST_STRING
= "ASCII abcde xyz\n"
+ "German äöü ÄÖÜ ß\n"
+ "Polish ąęźżńł\n"
+ "Russian абвгдеж эюя\n"
+ "CJK 你好\n";
public static void main(String[] args)
throws Exception
{
String[] encodings = new String[] {
"UTF-8", "UTF-16LE", "UTF-16BE", "UTF-32LE", "UTF-32BE" };
for (String encoding: encodings) {
System.out.println("== " + encoding);
for (boolean writeBom: new Boolean[] {false, true}) {
System.out.println(writeBom ? "= bom" : "= no bom");
String output = (writeBom ? BOM : "") + TEST_STRING;
byte[] bytes = output.getBytes(encoding);
System.out.write(bytes);
FileOutputStream out = new FileOutputStream("uc-test-"
+ encoding + (writeBom ? "-bom.txt" : "-nobom.txt"));
out.write(bytes);
out.close();
}
}
}
}
默认代码页中的输出?总垃圾!
Z:\andrew\projects\sx\1259084>chcp
Active code page: 850
Z:\andrew\projects\sx\1259084>java Foo
== UTF-8
= no bom
ASCII abcde xyz
German ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish ąęźżńł
Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK õ¢áÕÑ¢
= bom
´╗┐ASCII abcde xyz
German ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish ąęźżńł
Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK õ¢áÕÑ¢
== UTF-16LE
= no bom
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺↓☺z☺|☺D☺B☺
R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦
C J K `O}Y
= bom
■A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺↓☺z☺|☺D☺B☺
R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦
C J K `O}Y
== UTF-16BE
= no bom
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣☺↓☺z☺|☺D☺B
R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O
C J K O`Y}
= bom
■ A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣☺↓☺z☺|☺D☺B
R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O
C J K O`Y}
== UTF-32LE
= no bom
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺
R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N
♦ O♦
C J K `O }Y
= bom
■ A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺
R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N
♦ O♦
C J K `O }Y
== UTF-32BE
= no bom
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B
R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N
♦O
C J K O` Y}
= bom
■ A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B
R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N
♦O
C J K O` Y}
但是,如果我们type被保存的文件呢?它们包含确切的
. .
Z:\andrew\projects\sx\1259084>type *.txt
uc-test-UTF-16BE-bom.txt
■ A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣☺↓☺z☺|☺D☺B
R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O
C J K O`Y}
uc-test-UTF-16BE-nobom.txt
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣☺↓☺z☺|☺D☺B
R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O
C J K O`Y}
uc-test-UTF-16LE-bom.txt
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
uc-test-UTF-16LE-nobom.txt
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺↓☺z☺|☺D☺B☺
R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦
C J K `O}Y
uc-test-UTF-32BE-bom.txt
■ A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B
R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N
♦O
C J K O` Y}
uc-test-UTF-32BE-nobom.txt
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B
R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N
♦O
C J K O` Y}
uc-test-UTF-32LE-bom.txt
A S C I I a b c d e x y z
G e r m a n ä ö ü Ä Ö Ü ß
P o l i s h ą ę ź ż ń ł
R u s s i a n а б в г д е ж э ю я
C J K 你 好
uc-test-UTF-32LE-nobom.txt
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺
R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N
♦ O♦
C J K `O }Y
uc-test-UTF-8-bom.txt
´╗┐ASCII abcde xyz
German ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish ąęźżńł
Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK õ¢áÕÑ¢
uc-test-UTF-8-nobom.txt
ASCII abcde xyz
German ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish ąęźżńł
Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK õ¢áÕÑ¢
只有工作的东西是UTF-16LE文件,带有BOM,打印到
.控制台通过type
如果我们使用type以外的东西来打印文件,就会得到垃圾:
Z:\andrew\projects\sx\1259084>copy uc-test-UTF-16LE-bom.txt CON
■A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺↓☺z☺|☺D☺B☺
R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦
C J K `O}Y
1 file(s) copied.
Z:\andrew\projects\sx\1259084>chcp 65001
Active code page: 65001
Z:\andrew\projects\sx\1259084>java Foo
== UTF-8
= no bom
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
ж эюя
CJK 你好
你好
好
�
= bom
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
еж эюя
CJK 你好
你好
好
�
== UTF-16LE
= no bom
A S C I I a b c d e x y z
…
然而,设置Unicode UTF-8代码页的C程序:
#include <stdio.h>
#include <windows.h>
int main() {
int c, n;
UINT oldCodePage;
char buf[1024];
oldCodePage = GetConsoleOutputCP();
if (!SetConsoleOutputCP(65001)) {
printf("error\n");
}
freopen("uc-test-UTF-8-nobom.txt", "rb", stdin);
n = fread(buf, sizeof(buf[0]), sizeof(buf), stdin);
fwrite(buf, sizeof(buf[0]), n, stdout);
SetConsoleOutputCP(oldCodePage);
return 0;
}