用 Java 为 XML 编码文本数据的最佳方法？

小开

最佳答案

Very simply: use an XML library. That way it will actually be right instead of requiring detailed knowledge of bits of the XML spec.

小开

Use JAXP and forget about text handling it will be done for you automatically.

小开

Just use.

<![CDATA[ your text here ]]>

This will allow any characters except the ending

]]>

So you can include characters that would be illegal such as & and >. For example.

<element><![CDATA[ characters such as & and > are allowed ]]></element>

However, attributes will need to be escaped as CDATA blocks can not be used for them.

小开

As others have mentioned, using an XML library is the easiest way. If you do want to escape yourself, you could look into StringEscapeUtils from the Apache Commons Lang library.

小开

Note: Your question is about escaping, not encoding. Escaping is using <, etc. to allow the parser to distinguish between "this is an XML command" and "this is some text". Encoding is the stuff you specify in the XML header (UTF-8, ISO-8859-1, etc).

First of all, like everyone else said, use an XML library. XML looks simple but the encoding+escaping stuff is dark voodoo (which you'll notice as soon as you encounter umlauts and Japanese and other weird stuff like "full width digits" (&#FF11; is 1)). Keeping XML human readable is a Sisyphus' task.

I suggest never to try to be clever about text encoding and escaping in XML. But don't let that stop you from trying; just remember when it bites you (and it will).

That said, if you use only UTF-8, to make things more readable you can consider this strategy:

If the text does contain '<', '>' or '&', wrap it in <![CDATA[ ... ]]>
If the text doesn't contain these three characters, don't warp it.

I'm using this in an SQL editor and it allows the developers to cut&paste SQL from a third party SQL tool into the XML without worrying about escaping. This works because the SQL can't contain umlauts in our case, so I'm safe.

小开

This has worked well for me to provide an escaped version of a text string:

public class XMLHelper {


/**
* Returns the string where all non-ascii and <, &, > are encoded as numeric entities. I.e. "&lt;A &amp; B &gt;"
* .... (insert result here). The result is safe to include anywhere in a text field in an XML-string. If there was
* no characters to protect, the original string is returned.
*
* @param originalUnprotectedString
*            original string which may contain characters either reserved in XML or with different representation
*            in different encodings (like 8859-1 and UFT-8)
* @return
*/
public static String protectSpecialCharacters(String originalUnprotectedString) {
if (originalUnprotectedString == null) {
return null;
}
boolean anyCharactersProtected = false;


StringBuffer stringBuffer = new StringBuffer();
for (int i = 0; i < originalUnprotectedString.length(); i++) {
char ch = originalUnprotectedString.charAt(i);


boolean controlCharacter = ch < 32;
boolean unicodeButNotAscii = ch > 126;
boolean characterWithSpecialMeaningInXML = ch == '<' || ch == '&' || ch == '>';


if (characterWithSpecialMeaningInXML || unicodeButNotAscii || controlCharacter) {
stringBuffer.append("&#" + (int) ch + ";");
anyCharactersProtected = true;
} else {
stringBuffer.append(ch);
}
}
if (anyCharactersProtected == false) {
return originalUnprotectedString;
}


return stringBuffer.toString();
}


}

小开

While idealism says use an XML library, IMHO if you have a basic idea of XML then common sense and performance says template it all the way. It's arguably more readable too. Though using the escaping routines of a library is probably a good idea.

Consider this: XML was meant to be written by humans.

Use libraries for generating XML when having your XML as an "object" better models your problem. For example, if pluggable modules participate in the process of building this XML.

Edit: as for how to actually escape XML in templates, use of CDATA or escapeXml(string) from JSTL are two good solutions, escapeXml(string) can be used like this:

<%@taglib prefix="fn" uri="http://java.sun.com/jsp/jstl/functions"%>


<item>${fn:escapeXml(value)}</item>

小开

To escape XML characters, the easiest way is to use the Apache Commons Lang project, JAR downloadable from: http://commons.apache.org/lang/

The class is this: org.apache.commons.lang3.StringEscapeUtils;

It has a method named "escapeXml", that will return an appropriately escaped String.

小开

The behavior of StringEscapeUtils.escapeXml() has changed from Commons Lang 2.5 to 3.0. It now no longer escapes Unicode characters greater than 0x7f.

This is a good thing, the old method was to be a bit to eager to escape entities that could just be inserted into a utf8 document.

The new escapers to be included in Google Guava 11.0 also seem promising: http://code.google.com/p/guava-libraries/issues/detail?id=799

小开

Try this:

String xmlEscapeText(String t) {
StringBuilder sb = new StringBuilder();
for(int i = 0; i < t.length(); i++){
char c = t.charAt(i);
switch(c){
case '<': sb.append("&lt;"); break;
case '>': sb.append("&gt;"); break;
case '\"': sb.append("&quot;"); break;
case '&': sb.append("&amp;"); break;
case '\'': sb.append("&apos;"); break;
default:
if(c>0x7e) {
sb.append("&#"+((int)c)+";");
}else
sb.append(c);
}
}
return sb.toString();
}

小开

StringEscapeUtils.escapeXml() does not escape control characters (< 0x20). XML 1.1 allows control characters; XML 1.0 does not. For example, XStream.toXML() will happily serialize a Java object's control characters into XML, which an XML 1.0 parser will reject.

To escape control characters with Apache commons-lang, use

NumericEntityEscaper.below(0x20).translate(StringEscapeUtils.escapeXml(str))

小开

Here's an easy solution and it's great for encoding accented characters too!

String in = "Hi Lârry & Môe!";


StringBuilder out = new StringBuilder();
for(int i = 0; i < in.length(); i++) {
char c = in.charAt(i);
if(c < 31 || c > 126 || "<>\"'\\&".indexOf(c) >= 0) {
out.append("&#" + (int) c + ";");
} else {
out.append(c);
}
}


System.out.printf("%s%n", out);

Outputs

Hi L&#226;rry &#38; M&#244;e!

小开

While I agree with Jon Skeet in principle, sometimes I don't have the option to use an external XML library. And I find it peculiar the two functions to escape/unescape a simple value (attribute or tag, not full document) are not available in the standard XML libraries included with Java.

As a result and based on the different answers I have seen posted here and elsewhere, here is the solution I've ended up creating (nothing worked as a simple copy/paste):

  public final static String ESCAPE_CHARS = "<>&\"\'";
public final static List<String> ESCAPE_STRINGS = Collections.unmodifiableList(Arrays.asList(new String[] {
"&lt;"
, "&gt;"
, "&amp;"
, "&quot;"
, "&apos;"
}));


private static String UNICODE_NULL = "" + ((char)0x00); //null
private static String UNICODE_LOW =  "" + ((char)0x20); //space
private static String UNICODE_HIGH = "" + ((char)0x7f);


//should only be used for the content of an attribute or tag
public static String toEscaped(String content) {
String result = content;
    

if ((content != null) && (content.length() > 0)) {
boolean modified = false;
StringBuilder stringBuilder = new StringBuilder(content.length());
for (int i = 0, count = content.length(); i < count; ++i) {
String character = content.substring(i, i + 1);
int pos = ESCAPE_CHARS.indexOf(character);
if (pos > -1) {
stringBuilder.append(ESCAPE_STRINGS.get(pos));
modified = true;
}
else {
if (    (character.compareTo(UNICODE_LOW) > -1)
&& (character.compareTo(UNICODE_HIGH) < 1)
) {
stringBuilder.append(character);
}
else {
//Per URL reference below, Unicode null character is always restricted from XML
//URL: https://en.wikipedia.org/wiki/Valid_characters_in_XML
if (character.compareTo(UNICODE_NULL) != 0) {
stringBuilder.append("&#" + ((int)character.charAt(0)) + ";");
}
modified = true;
}
}
}
if (modified) {
result = stringBuilder.toString();
}
}
    

return result;
}

The above accommodates several different things:

avoids using char based logic until it absolutely has to - improves unicode compatibility
attempts to be as efficient as possible given the probability is the second "if" condition is likely the most used pathway
is a pure function; i.e. is thread-safe
optimizes nicely with the garbage collector by only returning the contents of the StringBuilder if something actually changed - otherwise, the original string is returned

At some point, I will write the inversion of this function, toUnescaped(). I just don't have time to do that today. When I do, I will come update this answer with the code. :)

小开

Try to encode the XML using Apache XML serializer

//Serialize DOM
OutputFormat format    = new OutputFormat (doc);
// as a String
StringWriter stringOut = new StringWriter ();
XMLSerializer serial   = new XMLSerializer (stringOut,
format);
serial.serialize(doc);
// Display the XML
System.out.println(stringOut.toString());

小开

public String escapeXml(String s) {
return s.replaceAll("&", "&amp;").replaceAll(">", "&gt;").replaceAll("<", "&lt;").replaceAll("\"", "&quot;").replaceAll("'", "&apos;");
}

小开

For those looking for the quickest-to-write solution: use methods from apache commons-lang:

StringEscapeUtils.escapeXml10() for xml 1.0
StringEscapeUtils.escapeXml11() for xml 1.1
StringEscapeUtils.escapeXml() is now deprecated, but was used commonly in the past

Remember to include dependency:

<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.5</version> <!--check current version! -->
</dependency>

小开

This question is eight years old and still not a fully correct answer! No, you should not have to import an entire third party API to do this simple task. Bad advice.

The following method will:

correctly handle characters outside the basic multilingual plane
escape characters required in XML
escape any non-ASCII characters, which is optional but common
replace illegal characters in XML 1.0 with the Unicode substitution character. There is no best option here - removing them is just as valid.

I've tried to optimise for the most common case, while still ensuring you could pipe /dev/random through this and get a valid string in XML.

public static String encodeXML(CharSequence s) {
StringBuilder sb = new StringBuilder();
int len = s.length();
for (int i=0;i<len;i++) {
int c = s.charAt(i);
if (c >= 0xd800 && c <= 0xdbff && i + 1 < len) {
c = ((c-0xd7c0)<<10) | (s.charAt(++i)&0x3ff);    // UTF16 decode
}
if (c < 0x80) {      // ASCII range: test most common case first
if (c < 0x20 && (c != '\t' && c != '\r' && c != '\n')) {
// Illegal XML character, even encoded. Skip or substitute
sb.append("&#xfffd;");   // Unicode replacement character
} else {
switch(c) {
case '&':  sb.append("&amp;"); break;
case '>':  sb.append("&gt;"); break;
case '<':  sb.append("&lt;"); break;
// Uncomment next two if encoding for an XML attribute
//                  case '\''  sb.append("&apos;"); break;
//                  case '\"'  sb.append("&quot;"); break;
// Uncomment next three if you prefer, but not required
//                  case '\n'  sb.append("&#10;"); break;
//                  case '\r'  sb.append("&#13;"); break;
//                  case '\t'  sb.append("&#9;"); break;


default:   sb.append((char)c);
}
}
} else if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff) {
// Illegal XML character, even encoded. Skip or substitute
sb.append("&#xfffd;");   // Unicode replacement character
} else {
sb.append("&#x");
sb.append(Integer.toHexString(c));
sb.append(';');
}
}
return sb.toString();
}

Edit: for those who continue to insist it foolish to write your own code for this when there are perfectly good Java APIs to deal with XML, you might like to know that the StAX API included with Oracle Java 8 (I haven't tested others) fails to encode CDATA content correctly: it doesn't escape ]]> sequences in the content. A third party library, even one that's part of the Java core, is not always the best option.

小开

You could use the Enterprise Security API (ESAPI) library, which provides methods like encodeForXML and encodeForXMLAttribute. Take a look at the documentation of the Encoder interface; it also contains examples of how to create an instance of DefaultEncoder.

小开

Just replace

 & with &amp;

And for other characters:

> with &gt;
< with &lt;
\" with &quot;
' with &apos;

小开

If you are looking for a library to get the job done, try:

Guava 26.0 documented here

return XmlEscapers.xmlContentEscaper().escape(text);

Note: There is also an xmlAttributeEscaper()
Apache Commons Text 1.4 documented here

StringEscapeUtils.escapeXml11(text)

Note: There is also an escapeXml10() method

小开

Here's what I found after searching everywhere looking for a solution:

Get the Jsoup library:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.12.1</version>
</dependency>

Then:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Entities
import org.jsoup.parser.Parser


String xml = '''<?xml version = "1.0"?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV = "http://www.w3.org/2001/12/soap-envelope"
SOAP-ENV:encodingStyle = "http://www.w3.org/2001/12/soap-encoding">


<SOAP-ENV:Body xmlns:m = "http://www.example.org/quotations">
<m:GetQuotation>
<m:QuotationsName> MiscroSoft@G>>gle.com </m:QuotationsName>
</m:GetQuotation>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>'''






Document doc = Jsoup.parse(new ByteArrayInputStream(xml.getBytes("UTF-8")), "UTF-8", "", Parser.xmlParser())
doc.outputSettings().charset("UTF-8")
doc.outputSettings().escapeMode(Entities.EscapeMode.base)


println doc.toString()

Hope this helps someone

小开

I have created my wrapper here, hope it will helps a lot, Click here You can modify depends on your requirements