字节顺序标记搞砸了 Java 中的文件读取

我正在尝试用 Java 读取 CSV 文件。有些文件可能在开始时有一个字节顺序标记,但不是全部。当存在时,字节顺序与第一行的其余部分一起读取,从而导致字符串比较问题。

当字节顺序标记出现时,是否有一种简单的方法可以跳过它?

88237 次浏览

很遗憾没有。你必须表明身份,然后跳过你自己。这一页详细说明了你要注意什么。有关详细信息,请参阅 这个所以问题

编辑 : 我已经在 GitHub 上发布了一个合适的版本: < a href = “ https://GitHub.com/gpakosz/UnicodeBOMInputStream”rel = “ norefrer”> https://GitHub.com/gpakosz/unicodebominputstream


这是我前一段时间编写的一个类,我只是在粘贴之前编辑了包的名称。没有什么特别的,它与发布在 SUN bug 数据库中的解决方案非常相似。将其合并到代码中就可以了。

/* ____________________________________________________________________________
*
* File:    UnicodeBOMInputStream.java
* Author:  Gregory Pakosz.
* Date:    02 - November - 2005
* ____________________________________________________________________________
*/
package com.stackoverflow.answer;


import java.io.IOException;
import java.io.InputStream;
import java.io.PushbackInputStream;


/**
* The <code>UnicodeBOMInputStream</code> class wraps any
* <code>InputStream</code> and detects the presence of any Unicode BOM
* (Byte Order Mark) at its beginning, as defined by
* <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
*
* <p>The
* <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
* defines 5 types of BOMs:<ul>
* <li><pre>00 00 FE FF  = UTF-32, big-endian</pre></li>
* <li><pre>FF FE 00 00  = UTF-32, little-endian</pre></li>
* <li><pre>FE FF        = UTF-16, big-endian</pre></li>
* <li><pre>FF FE        = UTF-16, little-endian</pre></li>
* <li><pre>EF BB BF     = UTF-8</pre></li>
* </ul></p>
*
* <p>Use the {@link #getBOM()} method to know whether a BOM has been detected
* or not.
* </p>
* <p>Use the {@link #skipBOM()} method to remove the detected BOM from the
* wrapped <code>InputStream</code> object.</p>
*/
public class UnicodeBOMInputStream extends InputStream
{
/**
* Type safe enumeration class that describes the different types of Unicode
* BOMs.
*/
public static final class BOM
{
/**
* NONE.
*/
public static final BOM NONE = new BOM(new byte[]{},"NONE");


/**
* UTF-8 BOM (EF BB BF).
*/
public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
(byte)0xBB,
(byte)0xBF},
"UTF-8");


/**
* UTF-16, little-endian (FF FE).
*/
public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
(byte)0xFE},
"UTF-16 little-endian");


/**
* UTF-16, big-endian (FE FF).
*/
public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
(byte)0xFF},
"UTF-16 big-endian");


/**
* UTF-32, little-endian (FF FE 00 00).
*/
public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
(byte)0xFE,
(byte)0x00,
(byte)0x00},
"UTF-32 little-endian");


/**
* UTF-32, big-endian (00 00 FE FF).
*/
public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
(byte)0x00,
(byte)0xFE,
(byte)0xFF},
"UTF-32 big-endian");


/**
* Returns a <code>String</code> representation of this <code>BOM</code>
* value.
*/
public final String toString()
{
return description;
}


/**
* Returns the bytes corresponding to this <code>BOM</code> value.
*/
public final byte[] getBytes()
{
final int     length = bytes.length;
final byte[]  result = new byte[length];


// Make a defensive copy
System.arraycopy(bytes,0,result,0,length);


return result;
}


private BOM(final byte bom[], final String description)
{
assert(bom != null)               : "invalid BOM: null is not allowed";
assert(description != null)       : "invalid description: null is not allowed";
assert(description.length() != 0) : "invalid description: empty string is not allowed";


this.bytes          = bom;
this.description  = description;
}


final byte    bytes[];
private final String  description;


} // BOM


/**
* Constructs a new <code>UnicodeBOMInputStream</code> that wraps the
* specified <code>InputStream</code>.
*
* @param inputStream an <code>InputStream</code>.
*
* @throws NullPointerException when <code>inputStream</code> is
* <code>null</code>.
* @throws IOException on reading from the specified <code>InputStream</code>
* when trying to detect the Unicode BOM.
*/
public UnicodeBOMInputStream(final InputStream inputStream) throws  NullPointerException,
IOException


{
if (inputStream == null)
throw new NullPointerException("invalid input stream: null is not allowed");


in = new PushbackInputStream(inputStream,4);


final byte  bom[] = new byte[4];
final int   read  = in.read(bom);


switch(read)
{
case 4:
if ((bom[0] == (byte)0xFF) &&
(bom[1] == (byte)0xFE) &&
(bom[2] == (byte)0x00) &&
(bom[3] == (byte)0x00))
{
this.bom = BOM.UTF_32_LE;
break;
}
else
if ((bom[0] == (byte)0x00) &&
(bom[1] == (byte)0x00) &&
(bom[2] == (byte)0xFE) &&
(bom[3] == (byte)0xFF))
{
this.bom = BOM.UTF_32_BE;
break;
}


case 3:
if ((bom[0] == (byte)0xEF) &&
(bom[1] == (byte)0xBB) &&
(bom[2] == (byte)0xBF))
{
this.bom = BOM.UTF_8;
break;
}


case 2:
if ((bom[0] == (byte)0xFF) &&
(bom[1] == (byte)0xFE))
{
this.bom = BOM.UTF_16_LE;
break;
}
else
if ((bom[0] == (byte)0xFE) &&
(bom[1] == (byte)0xFF))
{
this.bom = BOM.UTF_16_BE;
break;
}


default:
this.bom = BOM.NONE;
break;
}


if (read > 0)
in.unread(bom,0,read);
}


/**
* Returns the <code>BOM</code> that was detected in the wrapped
* <code>InputStream</code> object.
*
* @return a <code>BOM</code> value.
*/
public final BOM getBOM()
{
// BOM type is immutable.
return bom;
}


/**
* Skips the <code>BOM</code> that was found in the wrapped
* <code>InputStream</code> object.
*
* @return this <code>UnicodeBOMInputStream</code>.
*
* @throws IOException when trying to skip the BOM from the wrapped
* <code>InputStream</code> object.
*/
public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
{
if (!skipped)
{
in.skip(bom.bytes.length);
skipped = true;
}
return this;
}


/**
* {@inheritDoc}
*/
public int read() throws IOException
{
return in.read();
}


/**
* {@inheritDoc}
*/
public int read(final byte b[]) throws  IOException,
NullPointerException
{
return in.read(b,0,b.length);
}


/**
* {@inheritDoc}
*/
public int read(final byte b[],
final int off,
final int len) throws IOException,
NullPointerException
{
return in.read(b,off,len);
}


/**
* {@inheritDoc}
*/
public long skip(final long n) throws IOException
{
return in.skip(n);
}


/**
* {@inheritDoc}
*/
public int available() throws IOException
{
return in.available();
}


/**
* {@inheritDoc}
*/
public void close() throws IOException
{
in.close();
}


/**
* {@inheritDoc}
*/
public synchronized void mark(final int readlimit)
{
in.mark(readlimit);
}


/**
* {@inheritDoc}
*/
public synchronized void reset() throws IOException
{
in.reset();
}


/**
* {@inheritDoc}
*/
public boolean markSupported()
{
return in.markSupported();
}


private final PushbackInputStream in;
private final BOM                 bom;
private       boolean             skipped = false;


} // UnicodeBOMInputStream

你是这样使用它的:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;


public final class UnicodeBOMInputStreamUsage
{
public static void main(final String[] args) throws Exception
{
FileInputStream fis = new FileInputStream("test/offending_bom.txt");
UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis);


System.out.println("detected BOM: " + ubis.getBOM());


System.out.print("Reading the content of the file without skipping the BOM: ");
InputStreamReader isr = new InputStreamReader(ubis);
BufferedReader br = new BufferedReader(isr);


System.out.println(br.readLine());


br.close();
isr.close();
ubis.close();
fis.close();


fis = new FileInputStream("test/offending_bom.txt");
ubis = new UnicodeBOMInputStream(fis);
isr = new InputStreamReader(ubis);
br = new BufferedReader(isr);


ubis.skipBOM();


System.out.print("Reading the content of the file after skipping the BOM: ");
System.out.println(br.readLine());


br.close();
isr.close();
ubis.close();
fis.close();
}


} // UnicodeBOMInputStreamUsage

Google Data API 有一个自动检测编码的 UnicodeReader

你可以用它来代替 InputStreamReader。下面是它的源代码的一个稍微紧凑的摘录,它非常简单:

public class UnicodeReader extends Reader {
private static final int BOM_SIZE = 4;
private final InputStreamReader reader;


/**
* Construct UnicodeReader
* @param in Input stream.
* @param defaultEncoding Default encoding to be used if BOM is not found,
* or <code>null</code> to use system default encoding.
* @throws IOException If an I/O error occurs.
*/
public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
byte bom[] = new byte[BOM_SIZE];
String encoding;
int unread;
PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
int n = pushbackStream.read(bom, 0, bom.length);


// Read ahead four bytes and check for BOM marks.
if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
encoding = "UTF-8";
unread = n - 3;
} else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
encoding = "UTF-16BE";
unread = n - 2;
} else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
encoding = "UTF-16LE";
unread = n - 2;
} else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
encoding = "UTF-32BE";
unread = n - 4;
} else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
encoding = "UTF-32LE";
unread = n - 4;
} else {
encoding = defaultEncoding;
unread = n;
}


// Unread bytes if necessary and skip BOM marks.
if (unread > 0) {
pushbackStream.unread(bom, (n - unread), unread);
} else if (unread < -1) {
pushbackStream.unread(bom, 0, 0);
}


// Use given encoding.
if (encoding == null) {
reader = new InputStreamReader(pushbackStream);
} else {
reader = new InputStreamReader(pushbackStream, encoding);
}
}


public String getEncoding() {
return reader.getEncoding();
}


public int read(char[] cbuf, int off, int len) throws IOException {
return reader.read(cbuf, off, len);
}


public void close() throws IOException {
reader.close();
}
}

< strong > Apache Commons IO 库有一个 InputStream,它可以检测和丢弃 BOM: BOMInputStream (javadoc):

BOMInputStream bomIn = new BOMInputStream(in);
int firstNonBOMByte = bomIn.read(); // Skips BOM
if (bomIn.hasBOM()) {
// has a UTF-8 BOM
}

如果您还需要检测不同的编码,它还可以区分各种不同的字节顺序标记,例如 UTF-8和 UTF-16 big + little endian-Details,见上面的 doc 链接。然后可以使用检测到的 ByteOrderMark选择 Charset对流进行解码。(如果您需要所有这些功能,可能有一种更简化的方法来实现这一点——也许 BalusC 中的 UnicodeReader 可以实现这一点?).请注意,一般来说,没有一种非常好的方法来检测某些字节的编码内容,但是如果流以 BOM 开始,显然这是有帮助的。

编辑 : 如果需要检测 UTF-16、 UTF-32等中的 BOM,那么构造函数应该是:

new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE,
ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)

更新@martin-charlesworth 的评论:)

更简单的解决办法:

public class BOMSkipper
{
public static void skip(Reader reader) throws IOException
{
reader.mark(1);
char[] possibleBOM = new char[1];
reader.read(possibleBOM);


if (possibleBOM[0] != '\ufeff')
{
reader.reset();
}
}
}

用法示例:

BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(file), fileExpectedCharset));
BOMSkipper.skip(input);
//Now UTF prefix not present:
input.readLine();
...

它与所有5个 UTF 编码工作!

The Apache Commons IO Library's BOMInputStream has already been mentioned by @rescdsk, but I did not see it mention how to get an InputStream 没有 the BOM.

我在 Scala 里是这么做的。

 import java.io._
val file = new File(path_to_xml_file_with_BOM)
val fileInpStream = new FileInputStream(file)
val bomIn = new BOMInputStream(fileInpStream,
false); // false means don't include BOM

为了简单地从文件中删除 BOM 字符,我建议使用 Apache Common IO

public BOMInputStream(InputStream delegate,
boolean include)
Constructs a new BOM InputStream that detects a a ByteOrderMark.UTF_8 and optionally includes it.
Parameters:
delegate - the InputStream to delegate to
include - true to include the UTF-8 BOM or false to exclude it

将 include 设置为 false,BOM 字符将被排除在外。

我也有同样的问题,因为我没有阅读一大堆文件,所以我做了一个更简单的解决方案。我认为我的编码是 UTF-8,因为当我在这个页面的帮助下打印出令人不快的字符: 获取字符的 unicode 值时,我发现它是 \ufeff。我使用代码 System.out.println( "\\u" + Integer.toHexString(str.charAt(0) | 0x10000).substring(1) );来打印出有问题的 unicode 值。

Once I had the offending unicode value, I replaced it in the first line of my file before I went on reading. The business logic of that section:

String str = reader.readLine().trim();
str = str.replace("\ufeff", "");

这解决了我的问题。然后我就可以毫无问题地继续处理文件了。我在 trim()上添加了空格,只是为了防止前面或后面出现空格,你可以根据你的具体需要来决定是否这样做。

NotePad + + 是将 UTF-8编码转换为 UTF-8(BOM)编码的好工具。

Https://notepad-plus-plus.org/downloads/

Java

public class UTF8BOMTester {


public static void main(String[] args) throws FileNotFoundException, IOException {
// TODO Auto-generated method stub
File file = new File("test.txt");
boolean same = UTF8BOMInputStream.isSameEncodingType(file);
System.out.println(same);
if (same) {
UTF8BOMInputStream is = new UTF8BOMInputStream(file);
BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
System.out.println(br.readLine());
}


}


static void bytesPrint(byte[] b) {
for (byte a : b)
System.out.printf("%x ", a);
}}

UTF8BOMInputStream.java

public class UTF8BOMInputStream extends InputStream {


byte[] SYMBLE_BOM = { (byte) 0xEF, (byte) 0xBB, (byte) 0xBF };
FileInputStream fis;
final boolean isSameEncodingType;
public UTF8BOMInputStream(File file) throws IOException {
FileInputStream fis=new FileInputStream(file);
byte[] symble=new byte[3];
fis.read(symble);
bytesPrint(symble);
isSameEncodingType=isSameEncodingType(symble);
if(isSameEncodingType)
this.fis=fis;
else
this.fis=null;
    

}


@Override
public int read() throws IOException {
return fis.read();
}


void bytesPrint(byte[] b) {
for (byte a : b)
System.out.printf("%x ", a);
}


boolean bytesCompare(byte[] a, byte[] b) {
if (a.length != b.length)
return false;


for (int i = 0; i < a.length; i++) {
if (a[i] != b[i])
return false;
}
return true;
}
boolean isSameEncodingType(byte[] symble) {
return bytesCompare(symble,SYMBLE_BOM);
}
public static boolean isSameEncodingType(File file) throws IOException {
return (new UTF8BOMInputStream(file)).isSameEncodingType;
}

下面是我的代码来读取大多数字符集中的 csv 文件。它应该涵盖99% 的情况。

        try(InputStream inputStream = new FileInputStream(csvFile);){
BOMInputStream bomInputStream = new BOMInputStream(inputStream ,ByteOrderMark.UTF_8, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_32LE, ByteOrderMark.UTF_32BE);
Charset charset;
if(!bomInputStream.hasBOM()) charset = StandardCharsets.UTF_8;
else if(bomInputStream.hasBOM(ByteOrderMark.UTF_8)) charset = StandardCharsets.UTF_8;
else if(bomInputStream.hasBOM(ByteOrderMark.UTF_16LE)) charset = StandardCharsets.UTF_16LE;
else if(bomInputStream.hasBOM(ByteOrderMark.UTF_16BE)) charset = StandardCharsets.UTF_16BE;
else { throw new Exception("The charset of the file " + csvFile + " is not supported.");}
            

try(Reader streamReader = new InputStreamReader(bomInputStream, charset);
BufferedReader bufferedReader = new BufferedReader(streamReader);) {
for(String line; (line = bufferedReader.readLine()) != null; ) {
String[] columns = line.split(",");
//read csv columns
}
}

IMO none of the given answers is really satisfying. Just skipping the BOM and then read the rest of the stream in the current platform's default encoding is definitively wrong. Remember: The platform default on Unix/Linux and windows differ: former is UTF-8, later is ANSI. Such a solution only works if the rest of the stream (after the BOM) only contains 7-bit ASCII characters (which, I admit, in most programmer near files like configurations is true). But as soon there are non ASCII characters, you will fail with this approach.

这就是为什么所有可以将字节数组/流转换为字符串(反之亦然)的 java 类/方法都有第二个参数来指示要使用的编码(Reader、 Writer、 Scanner、 String.getBytes ()等)。

世界上有太多的字符编码,不仅仅是 UTF-xx。而且,在当前的2021年,终端用户应用程序之间仍然存在很多编码问题,特别是如果它们运行在不同的平台(iOS、 windows、 unix)上。所有这些问题之所以存在,只是因为程序员太懒了,不去学习字符编码的工作原理。

因此,绝对必须首先评估要使用的编码,然后使用找到的编码执行字符串/流转换。第一步是参考相应的规范。而且,只有在读取流时无法确定所遇到的编码时,才必须自己对其进行评估。但请注意: 这样的评估永远只是一个“最佳猜测”,没有一种算法可以涵盖所有的可能性。

从这个意义上说,李的答案(和编码示例)从2021年2月6日是国际海事组织最好的一个,除了他回落到 UTF-8,如果没有 BOM。