找到任何文件编码的有效方法

是的,这是一个最常见的问题,这件事对我来说是模糊的,因为我不知道很多。

但我想要一个非常精确的方式,找到一个文件编码。 记事本 + + 是如此精确。

184909 次浏览

我会尝试以下步骤:

1)检查是否有字节顺序标记

2)检查该文件是否为有效的 UTF8

3)使用本地“ ANSI”代码页(按微软定义的 ANSI)

步骤2之所以有效,是因为除了 UTF8之外,代码页中的大多数非 ASCII 序列都不是有效的 UTF8。

StreamReader.CurrentEncoding属性很少为我返回正确的文本文件编码。通过分析文件的字节顺序标记(byte order mark,BOM) ,我在确定文件的 endianness 方面取得了更大的成功。如果文件没有 BOM,则无法确定文件的编码。

* 更新4/08/2020以包括 UTF-32LE 检测并返回正确的 UTF-32BE 编码

/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
// Read the BOM
var bom = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(bom, 0, 4);
}


// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true);  //UTF-32BE


// We actually have no idea what the encoding is if we reach this point, so
// you may wish to return null instead of defaulting to ASCII
return Encoding.ASCII;
}

使用 StreamReader类,下面的代码可以很好地工作:

  using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
{
reader.Peek(); // you need this!
var encoding = reader.CurrentEncoding;
}

技巧是使用 Peek调用,否则,。NET 还没有做任何事情(它还没有阅读前言,BOM)。当然,如果在检查编码之前使用任何其他 ReadXXX调用,它也能工作。

如果文件没有 BOM,那么将使用 defaultEncodingIfNoBom编码。还有一个没有这个参数的 StreamReader构造函数重载(在这种情况下,默认情况下,在任何读之前编码将被设置为 UTF8) ,但是我建议在您的上下文中定义您认为的默认编码。

我已经用 UTF8、 UTF16/Unicode (LE & BE)和 UTF32(LE & BE)的 BOM 文件成功地测试了这一点。它不适用于 UTF7。

在这里找 C #

Https://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx

string path = @"path\to\your\file.ext";


using (StreamReader sr = new StreamReader(path, true))
{
while (sr.Peek() >= 0)
{
Console.Write((char)sr.Read());
}


//Test for the encoding after reading, or at least
//after the first read.
Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);
Console.ReadLine();
Console.WriteLine();
}

下面的代码是我的 Powershell 代码,用来判断某些 cpp 或 h 或 ml 文件是否使用 ISO-8859-1(拉丁文 -1)或 UTF-8编码,而没有 BOM,如果两者都不是,那么假设它是 GB18030。我是一个在法国工作的中国人,MSVC 在法国计算机上存储为拉丁文 -1,在中国计算机上存储为 GB,这有助于我避免编码问题时,我的系统和我的同事之间进行源文件交换。

方法很简单,如果所有的字符都在 x00-x7E 之间,ASCII,UTF-8和 Latin1都是相同的,但是如果我用 UTF-8读取一个非 ASCII 文件,我们会发现特殊字符显示出来,所以试着用 Latin1读取。在拉丁文 -1中,x7F 和 xAF 之间是空的,而 GB 在 x00-xFF 之间使用 full,所以如果我在两者之间得到任何值,它就不是拉丁文 -1

代码是用 PowerShell 编写的,但是使用.net,所以很容易翻译成 C # 或 F #

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) {
$openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8)
$contentUTF = $openUTF.ReadToEnd()
[regex]$regex = '�'
$c=$regex.Matches($contentUTF).count
$openUTF.Close()
if ($c -ne 0) {
$openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1'))
$contentLatin1 = $openLatin1.ReadToEnd()
$openLatin1.Close()
[regex]$regex = '[\x7F-\xAF]'
$c=$regex.Matches($contentLatin1).count
if ($c -eq 0) {
[System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding)
$i.FullName
}
else {
$openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030'))
$contentGB = $openGB.ReadToEnd()
$openGB.Close()
[System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding)
$i.FullName
}
}
}
Write-Host -NoNewLine 'Press any key to continue...';
$null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');

看看这个。

UDE

这是 Mozilla 通用字符集检测器的一个端口,你可以像这样使用它..。

public static void Main(String[] args)
{
string filename = args[0];
using (FileStream fs = File.OpenRead(filename)) {
Ude.CharsetDetector cdet = new Ude.CharsetDetector();
cdet.Feed(fs);
cdet.DataEnd();
if (cdet.Charset != null) {
Console.WriteLine("Charset: {0}, confidence: {1}",
cdet.Charset, cdet.Confidence);
} else {
Console.WriteLine("Detection failed.");
}
}
}

提供@CodesIn勺提出的步骤的实现细节:

1)检查是否有字节顺序标记

2)检查该文件是否为有效的 UTF8

3)使用本地“ ANSI”代码页(按微软定义的 ANSI)

步骤2之所以有效,是因为除了 UTF8之外,代码页中的大多数非 ASCII 序列都不是有效的 UTF8。https://stackoverflow.com/a/4522251/867248更详细地解释了这一战术。

using System; using System.IO; using System.Text;


// Using encoding from BOM or UTF8 if no BOM found,
// check if the file is valid, by reading all lines
// If decoding fails, use the local "ANSI" codepage


public string DetectFileEncoding(Stream fileStream)
{
var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
using (var reader = new StreamReader(fileStream, Utf8EncodingVerifier,
detectEncodingFromByteOrderMarks: true, leaveOpen: true, bufferSize: 1024))
{
string detectedEncoding;
try
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
}
detectedEncoding = reader.CurrentEncoding.BodyName;
}
catch (Exception e)
{
// Failed to decode the file using the BOM/UT8.
// Assume it's local ANSI
detectedEncoding = "ISO-8859-1";
}
// Rewind the stream
fileStream.Seek(0, SeekOrigin.Begin);
return detectedEncoding;
}
}




[Test]
public void Test1()
{
Stream fs = File.OpenRead(@".\TestData\TextFile_ansi.csv");
var detectedEncoding = DetectFileEncoding(fs);


using (var reader = new StreamReader(fs, Encoding.GetEncoding(detectedEncoding)))
{
// Consume your file
var line = reader.ReadLine();
...

可能有用

string path = @"address/to/the/file.extension";


using (StreamReader sr = new StreamReader(path))
{
Console.WriteLine(sr.CurrentEncoding);
}

NET 不是很有帮助,但你可以尝试下面的算法:

  1. 尝试通过 BOM (字节顺序标记)找到编码... 很可能找不到
  2. 尝试解析不同的编码

以下是我们的呼吁:

var encoding = FileHelper.GetEncoding(filePath);
if (encoding == null)
throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");

密码如下:

public class FileHelper
{
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings
/// Defaults to UTF8 when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding or null.</returns>
public static Encoding GetEncoding(string filename)
{
var encodingByBOM = GetEncodingByBOM(filename);
if (encodingByBOM != null)
return encodingByBOM;


// BOM not found :(, so try to parse characters into several encodings
var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
if (encodingByParsingUTF8 != null)
return encodingByParsingUTF8;


var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
if (encodingByParsingLatin1 != null)
return encodingByParsingLatin1;


var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
if (encodingByParsingUTF7 != null)
return encodingByParsingUTF7;


return null;   // no encoding found
}


/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM)
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
private static Encoding GetEncodingByBOM(string filename)
{
// Read the BOM
var byteOrderMark = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(byteOrderMark, 0, 4);
}


// Analyze the BOM
if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;


return null;    // no BOM found
}


private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
{
var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());


try
{
using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
{
while (!textReader.EndOfStream)
{
textReader.ReadLine();   // in order to increment the stream position
}


// all text parsed ok
return textReader.CurrentEncoding;
}
}
catch (Exception ex) { }


return null;    //
}
}

看起来效果不错。

首先创建一个 helper 方法:

  private static Encoding TestCodePage(Encoding testCode, byte[] byteArray)
{
try
{
var encoding = Encoding.GetEncoding(testCode.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
var a = encoding.GetCharCount(byteArray);
return testCode;
}
catch (Exception e)
{
return null;
}
}

然后创建代码来测试源代码。在这个例子中,我得到了一个字节数组,我需要它的编码:

 public static Encoding DetectCodePage(byte[] contents)
{
if (contents == null || contents.Length == 0)
{
return Encoding.Default;
}


return TestCodePage(Encoding.UTF8, contents)
?? TestCodePage(Encoding.Unicode, contents)
?? TestCodePage(Encoding.BigEndianUnicode, contents)
?? TestCodePage(Encoding.GetEncoding(1252), contents) // Western European
?? TestCodePage(Encoding.GetEncoding(28591), contents) // ISO Western European
?? TestCodePage(Encoding.ASCII, contents)
?? TestCodePage(Encoding.Default, contents); // likely Unicode
}

我已经尝试了几种不同的方法来检测编码并解决大多数问题。

我做了以下利用微软 Nuget 包,它似乎为我工作到目前为止,但需要测试更多。
我的大部分测试都是在 UTF8,UTF8和 BOM 以及 ASNI 上进行的。

static void Main(string[] args)
{
var path = Directory.GetCurrentDirectory() + "\\TextFile2.txt";
List<string> contents = File.ReadLines(path, GetEncoding(path)).Where(w => !string.IsNullOrWhiteSpace(w)).ToList();


int i = 0;
foreach (var line in contents)
{
i++;
Console.WriteLine(line);
if (i > 100)
break;
}


}




public static Encoding GetEncoding(string filename)
{
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
var detectedEncoding = Microsoft.ProgramSynthesis.Detection.Encoding.EncodingIdentifier.IdentifyEncoding(file);
switch (detectedEncoding)
{
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf8:
return Encoding.UTF8;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Be:
return Encoding.BigEndianUnicode;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Le:
return Encoding.Unicode;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf32Le:
return Encoding.UTF32;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Ascii:
return Encoding.ASCII;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Iso88591:
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Unknown:
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Windows1252:
default:
return Encoding.Default;
}
}
}