读取 UTF-8-BOM 标记

我正在通过 FileReader 读取一个文件——该文件是 UTF-8解码的(使用 BOM) ,现在我的问题是: 我读取该文件并输出一个字符串,但遗憾的是,BOM 标记也被输出。为什么会这样?

fr = new FileReader(file);
br = new BufferedReader(fr);
String tmp = null;
while ((tmp = br.readLine()) != null) {
String text;
text = new String(tmp.getBytes(), "UTF-8");
content += text + System.getProperty("line.separator");
}

第一行后的输出

?<style>
93348 次浏览

In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream to handle this situation.

Take a look at this solution: Handle UTF8 file with BOM

The easiest fix is probably just to remove the resulting \uFEFF from the string, since it is extremely unlikely to appear for any other reason.

tmp = tmp.replace("\uFEFF", "");

Also see this Guava bug report

Use the Apache Commons library.

Class: org.apache.commons.io.input.BOMInputStream

Example usage:

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
//use reader
} finally {
inputStream.close();
}

Here's how I use the Apache BOMInputStream, it uses a try-with-resources block. The "false" argument tells the object to ignore the following BOMs (we use "BOM-less" text files for safety reasons, haha):

try( BufferedReader br = new BufferedReader(
new InputStreamReader( new BOMInputStream( new FileInputStream(
file), false, ByteOrderMark.UTF_8,
ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) )
{
// use br here


} catch( Exception e)


}

It's mentioned here that this is usually a problem with files on Windows.

One possible solution would be running the file through a tool like dos2unix first.

Use Apache Commons IO.

For example, let's take a look on my code (used for reading a text file with both latin and cyrillic characters) below:

String defaultEncoding = "UTF-16";
InputStream inputStream = new FileInputStream(new File("/temp/1.txt"));


BOMInputStream bomInputStream = new BOMInputStream(inputStream);


ByteOrderMark bom = bomInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName);
int data = reader.read();
while (data != -1) {


char theChar = (char) data;
data = reader.read();
ari.add(Character.toString(theChar));
}
reader.close();

As a result we have an ArrayList named "ari" with all characters from file "1.txt" excepting BOM.

The easiest way I found to bypass BOM

BufferedReader br = new BufferedReader(new InputStreamReader(fis));
while ((currentLine = br.readLine()) != null) {
//case of, remove the BOM of UTF-8 BOM
currentLine = currentLine.replace("","");

Consider UnicodeReader from Google which does all this work for you.

Charset utf8 = StandardCharsets.UTF_8;  // default if no BOM present
try (Reader r = new UnicodeReader(new FileInputStream(file), utf8.name())) {
....
}

Maven Dependency:

<dependency>
<groupId>com.google.gdata</groupId>
<artifactId>core</artifactId>
<version>1.47.1</version>
</dependency>

If somebody wants to do it with the standard, this would be a way:

public static String cutBOM(String value) {
// UTF-8 BOM is EF BB BF, see https://en.wikipedia.org/wiki/Byte_order_mark
String bom = String.format("%x", new BigInteger(1, value.substring(0,3).getBytes()));
if (bom.equals("efbbbf"))
// UTF-8
return value.substring(3, value.length());
else if (bom.substring(0, 2).equals("feff") || bom.substring(0, 2).equals("ffe"))
// UTF-16BE or UTF16-LE
return value.substring(2, value.length());
else
return value;
}