在构造使用数据的 XmlReader 或 XPathDocument 之前,如何从基于 XML 的数据源中删除无效的十六进制字符?

在 XmlReader 中使用基于 XML 的数据源之前,有没有什么简单/通用的方法来清理它,以便我可以优雅地使用不符合 XML 十六进制字符限制的 XML 数据?

注:

  • 解决方案需要处理 XML 使用字符的数据源 UTF-8以外的编码,例如: 指定字符编码 XML 文档声明 破坏字符编码 剥离时源无效 十六进制字符已成为 主要症结所在。
  • 删除无效的十六进制字符应该只删除十六进制编码的值,因为您通常可以在恰好包含与十六进制字符匹配的字符串的数据中找到 href 值。

背景:

我需要使用一个符合特定格式的基于 XML 的数据源(想想 Atom 或 RSS 提要) ,但希望能够使用已发布的数据源,这些数据源包含 XML 规范中无效的十六进制字符。

进去。如果您有一个表示 XML 数据源的 Stream,然后尝试使用 XmlReader 和/或 XPathDocument 解析它,则由于在 XML 数据中包含无效的十六进制字符,将引发异常。我当前解决这个问题的尝试是将 Stream 解析为一个字符串,并使用正则表达式来删除和/或替换无效的十六进制字符,但我正在寻找更高性能的解决方案。

102714 次浏览

It may not be perfect (emphasis added since people missing this disclaimer), but what I've done in that case is below. You can adjust to use with a stream.

/// <summary>
/// Removes control characters and other non-UTF-8 characters
/// </summary>
/// <param name="inString">The string to process</param>
/// <returns>A string with no control characters or entities above 0x00FD</returns>
public static string RemoveTroublesomeCharacters(string inString)
{
if (inString == null) return null;


StringBuilder newString = new StringBuilder();
char ch;


for (int i = 0; i < inString.Length; i++)
{


ch = inString[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
//if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
//if using .NET version prior to 4, use above logic
if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4
{
newString.Append(ch);
}
}
return newString.ToString();


}

I like Eugene's whitelist concept. I needed to do a similar thing as the original poster, but I needed to support all Unicode characters, not just up to 0x00FD. The XML spec is:

Char = #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

In .NET, the internal representation of Unicode characters is only 16 bits, so we can't `allow' 0x10000-0x10FFFF explicitly. The XML spec explicitly disallows the surrogate code points starting at 0xD800 from appearing. However it is possible that if we allowed these surrogate code points in our whitelist, utf-8 encoding our string might produce valid XML in the end as long as proper utf-8 encoding was produced from the surrogate pairs of utf-16 characters in the .NET string. I haven't explored this though, so I went with the safer bet and didn't allow the surrogates in my whitelist.

The comments in Eugene's solution are misleading though, the problem is that the characters we are excluding are not valid in XML ... they are perfectly valid Unicode code points. We are not removing `non-utf-8 characters'. We are removing utf-8 characters that may not appear in well-formed XML documents.

public static string XmlCharacterWhitelist( string in_string ) {
if( in_string == null ) return null;


StringBuilder sbOutput = new StringBuilder();
char ch;


for( int i = 0; i < in_string.Length; i++ ) {
ch = in_string[i];
if( ( ch >= 0x0020 && ch <= 0xD7FF ) ||
( ch >= 0xE000 && ch <= 0xFFFD ) ||
ch == 0x0009 ||
ch == 0x000A ||
ch == 0x000D ) {
sbOutput.Append( ch );
}
}
return sbOutput.ToString();
}
private static String removeNonUtf8CompliantCharacters( final String inString ) {
if (null == inString ) return null;
byte[] byteArr = inString.getBytes();
for ( int i=0; i < byteArr.length; i++ ) {
byte ch= byteArr[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
byteArr[i]=' ';
}
}
return new String( byteArr );
}

Try this for PHP!

$goodUTF8 = iconv("utf-8", "utf-8//IGNORE", $badUTF8);

The above solutions seem to be for removing invalid characters prior to converting to XML.

Use this code to remove invalid XML characters from an XML string. eg. &x1A;

    public static string CleanInvalidXmlChars( string Xml, string XMLVersion )
{
string pattern = String.Empty;
switch( XMLVersion )
{
case "1.0":
pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F]);";
break;
case "1.1":
pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF]);";
break;
default:
throw new Exception( "Error: Invalid XML Version!" );
}


Regex regex = new Regex( pattern, RegexOptions.IgnoreCase );
if( regex.IsMatch( Xml ) )
Xml = regex.Replace( Xml, String.Empty );
return Xml;
}

http://balajiramesh.wordpress.com/2008/05/30/strip-illegal-xml-characters-based-on-w3c-standard/

You can pass non-UTF characters with the following:

string sFinalString  = "";
string hex = "";
foreach (char ch in UTFCHAR)
{
int tmp = ch;
if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
{
sFinalString  += ch;
}
else
{
sFinalString  += "&#" + tmp+";";
}
}

Modernising dnewcombe's answer, you could take a slightly simpler approach

public static string RemoveInvalidXmlChars(string input)
{
var isValid = new Predicate<char>(value =>
(value >= 0x0020 && value <= 0xD7FF) ||
(value >= 0xE000 && value <= 0xFFFD) ||
value == 0x0009 ||
value == 0x000A ||
value == 0x000D);


return new string(Array.FindAll(input.ToCharArray(), isValid));
}

or, with Linq

public static string RemoveInvalidXmlChars(string input)
{
return new string(input.Where(value =>
(value >= 0x0020 && value <= 0xD7FF) ||
(value >= 0xE000 && value <= 0xFFFD) ||
value == 0x0009 ||
value == 0x000A ||
value == 0x000D).ToArray());
}

I'd be interested to know how the performance of these methods compares and how they all compare to a black list approach using Buffer.BlockCopy.

As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:

void Main() {
string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False


content = RemoveInvalidXmlChars(content);
Console.WriteLine(IsValidXmlString(content)); // True
}


static string RemoveInvalidXmlChars(string text) {
char[] validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
return new string(validXmlChars);
}


static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}

Regex based approach

public static string StripInvalidXmlCharacters(string str)
{
var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");
return invalidXmlCharactersRegex.Replace(str, "");

}

See my blogpost for more details

Here is dnewcome's answer in a custom StreamReader. It simply wraps a real stream reader and replaces the characters as they are read.

I only implemented a few methods to save myself time. I used this in conjunction with XDocument.Load and a file stream and only the Read(char[] buffer, int index, int count) method was called, so it worked like this. You may need to implement additional methods to get this to work for your application. I used this approach because it seems more efficient than the other answers. I also only implemented one of the constructors, you could obviously implement any of the StreamReader constructors that you need, since it is just a pass through.

I chose to replace the characters rather than removing them because it greatly simplifies the solution. In this way the length of the text stays the same, so there is no need to keep track of a separate index.

public class InvalidXmlCharacterReplacingStreamReader : TextReader
{
private StreamReader implementingStreamReader;
private char replacementCharacter;


public InvalidXmlCharacterReplacingStreamReader(Stream stream, char replacementCharacter)
{
implementingStreamReader = new StreamReader(stream);
this.replacementCharacter = replacementCharacter;
}


public override void Close()
{
implementingStreamReader.Close();
}


public override ObjRef CreateObjRef(Type requestedType)
{
return implementingStreamReader.CreateObjRef(requestedType);
}


public void Dispose()
{
implementingStreamReader.Dispose();
}


public override bool Equals(object obj)
{
return implementingStreamReader.Equals(obj);
}


public override int GetHashCode()
{
return implementingStreamReader.GetHashCode();
}


public override object InitializeLifetimeService()
{
return implementingStreamReader.InitializeLifetimeService();
}


public override int Peek()
{
int ch = implementingStreamReader.Peek();
if (ch != -1)
{
if (
(ch < 0x0020 || ch > 0xD7FF) &&
(ch < 0xE000 || ch > 0xFFFD) &&
ch != 0x0009 &&
ch != 0x000A &&
ch != 0x000D
)
{
return replacementCharacter;
}
}
return ch;
}


public override int Read()
{
int ch = implementingStreamReader.Read();
if (ch != -1)
{
if (
(ch < 0x0020 || ch > 0xD7FF) &&
(ch < 0xE000 || ch > 0xFFFD) &&
ch != 0x0009 &&
ch != 0x000A &&
ch != 0x000D
)
{
return replacementCharacter;
}
}
return ch;
}


public override int Read(char[] buffer, int index, int count)
{
int readCount = implementingStreamReader.Read(buffer, index, count);
for (int i = index; i < readCount+index; i++)
{
char ch = buffer[i];
if (
(ch < 0x0020 || ch > 0xD7FF) &&
(ch < 0xE000 || ch > 0xFFFD) &&
ch != 0x0009 &&
ch != 0x000A &&
ch != 0x000D
)
{
buffer[i] = replacementCharacter;
}
}
return readCount;
}


public override Task<int> ReadAsync(char[] buffer, int index, int count)
{
throw new NotImplementedException();
}


public override int ReadBlock(char[] buffer, int index, int count)
{
throw new NotImplementedException();
}


public override Task<int> ReadBlockAsync(char[] buffer, int index, int count)
{
throw new NotImplementedException();
}


public override string ReadLine()
{
throw new NotImplementedException();
}


public override Task<string> ReadLineAsync()
{
throw new NotImplementedException();
}


public override string ReadToEnd()
{
throw new NotImplementedException();
}


public override Task<string> ReadToEndAsync()
{
throw new NotImplementedException();
}


public override string ToString()
{
return implementingStreamReader.ToString();
}
}

DRY implementation of this answer's solution (using a different constructor - feel free to use the one you need in your application):

public class InvalidXmlCharacterReplacingStreamReader : StreamReader
{
private readonly char _replacementCharacter;


public InvalidXmlCharacterReplacingStreamReader(string fileName, char replacementCharacter) : base(fileName)
{
this._replacementCharacter = replacementCharacter;
}


public override int Peek()
{
int ch = base.Peek();
if (ch != -1 && IsInvalidChar(ch))
{
return this._replacementCharacter;
}
return ch;
}


public override int Read()
{
int ch = base.Read();
if (ch != -1 && IsInvalidChar(ch))
{
return this._replacementCharacter;
}
return ch;
}


public override int Read(char[] buffer, int index, int count)
{
int readCount = base.Read(buffer, index, count);
for (int i = index; i < readCount + index; i++)
{
char ch = buffer[i];
if (IsInvalidChar(ch))
{
buffer[i] = this._replacementCharacter;
}
}
return readCount;
}


private static bool IsInvalidChar(int ch)
{
return (ch < 0x0020 || ch > 0xD7FF) &&
(ch < 0xE000 || ch > 0xFFFD) &&
ch != 0x0009 &&
ch != 0x000A &&
ch != 0x000D;
}
}

Use this function to remove invalid xml characters.

public static string CleanInvalidXmlChars(string text)
{
string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
return Regex.Replace(text, re, "");
}

Modified answer or original answer by Neolisk above.
Changes: of \0 character is passed, removal is done, rather than a replacement. also, made use of XmlConvert.IsXmlChar(char) method

    /// <summary>
/// Replaces invalid Xml characters from input file, NOTE: if replacement character is \0, then invalid Xml character is removed, instead of 1-for-1 replacement
/// </summary>
public class InvalidXmlCharacterReplacingStreamReader : StreamReader
{
private readonly char _replacementCharacter;


public InvalidXmlCharacterReplacingStreamReader(string fileName, char replacementCharacter)
: base(fileName)
{
_replacementCharacter = replacementCharacter;
}


public override int Peek()
{
int ch = base.Peek();
if (ch != -1 && IsInvalidChar(ch))
{
if ('\0' == _replacementCharacter)
return Peek(); // peek at the next one


return _replacementCharacter;
}
return ch;
}


public override int Read()
{
int ch = base.Read();
if (ch != -1 && IsInvalidChar(ch))
{
if ('\0' == _replacementCharacter)
return Read(); // read next one


return _replacementCharacter;
}
return ch;
}


public override int Read(char[] buffer, int index, int count)
{
int readCount= 0, ch;


for (int i = 0; i < count && (ch = Read()) != -1; i++)
{
readCount++;
buffer[index + i] = (char)ch;
}


return readCount;
}




private static bool IsInvalidChar(int ch)
{
return !XmlConvert.IsXmlChar((char)ch);
}
}

I created a slightly updated version of @Neolisk's answer, which supports the *Async functions and uses the .Net 4.0 XmlConvert.IsXmlChar function.

public class InvalidXmlCharacterReplacingStreamReader : StreamReader
{
private readonly char _replacementCharacter;


public InvalidXmlCharacterReplacingStreamReader(string fileName, char replacementCharacter) : base(fileName)
{
_replacementCharacter = replacementCharacter;
}


public InvalidXmlCharacterReplacingStreamReader(Stream stream, char replacementCharacter) : base(stream)
{
_replacementCharacter = replacementCharacter;
}


public override int Peek()
{
var ch = base.Peek();
if (ch != -1 && IsInvalidChar(ch))
{
return _replacementCharacter;
}
return ch;
}


public override int Read()
{
var ch = base.Read();
if (ch != -1 && IsInvalidChar(ch))
{
return _replacementCharacter;
}
return ch;
}


public override int Read(char[] buffer, int index, int count)
{
var readCount = base.Read(buffer, index, count);
ReplaceInBuffer(buffer, index, readCount);
return readCount;
}


public override async Task<int> ReadAsync(char[] buffer, int index, int count)
{
var readCount = await base.ReadAsync(buffer, index, count).ConfigureAwait(false);
ReplaceInBuffer(buffer, index, readCount);
return readCount;
}


private void ReplaceInBuffer(char[] buffer, int index, int readCount)
{
for (var i = index; i < readCount + index; i++)
{
var ch = buffer[i];
if (IsInvalidChar(ch))
{
buffer[i] = _replacementCharacter;
}
}
}


private static bool IsInvalidChar(int ch)
{
return IsInvalidChar((char)ch);
}


private static bool IsInvalidChar(char ch)
{
return !XmlConvert.IsXmlChar(ch);
}
}