// Replace invalid characters with empty strings.
Regex.Replace(inputString, @"[^\w\.@-]", "");
The regular expression pattern [^\w.@-] matches any character that is not a word character, a period, an @ symbol, or a hyphen. A word character is any letter, decimal digit, or punctuation connector such as an underscore. Any character that matches this pattern is replaced by String.Empty, which is the string defined by the replacement pattern. To allow additional characters in user input, add those characters to the character class in the regular expression pattern. For example, the regular expression pattern [^\w.@-\%] also allows a percentage symbol and a backslash in an input string.
Here is a function to remove the characters from a specified XML string:
using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
namespace XMLUtils
{
class Standards
{
/// <summary>
/// Strips non-printable ascii characters
/// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1
/// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0
/// </summary>
/// <param name="content">contents</param>
/// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param>
private void StripIllegalXMLChars(string tmpContents, string XMLVersion)
{
string pattern = String.Empty;
switch (XMLVersion)
{
case "1.0":
pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])";
break;
case "1.1":
pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])";
break;
default:
throw new Exception("Error: Invalid XML Version!");
}
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(tmpContents))
{
tmpContents = regex.Replace(tmpContents, String.Empty);
}
tmpContents = string.Empty;
}
}
}
As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:
Update:
It should be mentioned that the encoding operation produces a string with a length which is greater or equal than a length of a source string. It might be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.
Here is an optimized version of the above method RemoveInvalidXmlChars which doesn't create a new array on every call, thus stressing the GC unnecessarily:
public static string RemoveInvalidXmlChars(string text)
{
if (text == null)
return text;
if (text.Length == 0)
return text;
// a bit complicated, but avoids memory usage if not necessary
StringBuilder result = null;
for (int i = 0; i < text.Length; i++)
{
var ch = text[i];
if (XmlConvert.IsXmlChar(ch))
{
result?.Append(ch);
}
else if (result == null)
{
result = new StringBuilder();
result.Append(text.Substring(0, i));
}
}
if (result == null)
return text; // no invalid xml chars detected - return original text
else
return result.ToString();
}
This simple method replace the invalid characters with the same value but accepted in the XML context.
To write string use XMLWriteStringWithoutIllegalCharacters(string UnfilteredString).
To read string use XMLReadStringWithoutIllegalCharacters(string FilteredString).