如何在 Java 中取消转义 HTML 字符实体?

基本上,我想解码一个给定的 Html 文档,并替换所有特殊的字符,比如 " "-> " "">"-> ">"

在.NET 中,我们可以利用 HttpUtility.HtmlDecode

Java 中的等价函数是什么?

299151 次浏览

我使用了 Apache Commons StringEscapeUtils.unescape eHtml4():

取消转义包含实体的字符串 转义为包含 实际的 Unicode 字符 对应的逃生。支持 HTML 4.0实体。

如果您想要模仿 php 函数 htmlspecalchars _ decode 使用 php 函数 get _ html _ trans _ table ()转储表,然后使用 Java 代码,

static Map<String,String> html_specialchars_table = new Hashtable<String,String>();
static {
html_specialchars_table.put("&lt;","<");
html_specialchars_table.put("&gt;",">");
html_specialchars_table.put("&amp;","&");
}
static String htmlspecialchars_decode_ENT_NOQUOTES(String s){
Enumeration en = html_specialchars_table.keys();
while(en.hasMoreElements()){
String key = en.nextElement();
String val = html_specialchars_table.get(key);
s = s.replaceAll(key, val);
}
return s;
}

考虑使用 操纵者Java 类。您可能需要添加一些项目(并非所有实体都在列表中)。

由 Kevin Hakanson 建议的 Apache Commons StringEscapeUtils 对我来说并不是100% 有效; 一些像 & # 145(左引号)这样的实体以某种方式被翻译成了“222”。我也试过 org.jSoup,也有同样的问题。

我在我的项目中尝试了 Apache Commons StringEscapeUtils.unescape eHtml3() ,但是对它的性能不满意。结果发现,它做了很多不必要的手术。首先,它为每个调用分配一个 StringWriter,即使在字符串中没有什么可以取消转义。我以不同的方式重写了代码,现在它运行得更快了。无论谁在谷歌上找到这个,欢迎使用它。

下面的代码取消所有 HTML3符号和数值转义符(相当于 Apache 取消转义符 Html3)。如果需要 HTML4,可以向映射添加更多条目。

package com.example;


import java.io.StringWriter;
import java.util.HashMap;


public class StringUtils {


public static final String unescapeHtml3(final String input) {
StringWriter writer = null;
int len = input.length();
int i = 1;
int st = 0;
while (true) {
// look for '&'
while (i < len && input.charAt(i-1) != '&')
i++;
if (i >= len)
break;


// found '&', look for ';'
int j = i;
while (j < len && j < i + MAX_ESCAPE + 1 && input.charAt(j) != ';')
j++;
if (j == len || j < i + MIN_ESCAPE || j == i + MAX_ESCAPE + 1) {
i++;
continue;
}


// found escape
if (input.charAt(i) == '#') {
// numeric escape
int k = i + 1;
int radix = 10;


final char firstChar = input.charAt(k);
if (firstChar == 'x' || firstChar == 'X') {
k++;
radix = 16;
}


try {
int entityValue = Integer.parseInt(input.substring(k, j), radix);


if (writer == null)
writer = new StringWriter(input.length());
writer.append(input.substring(st, i - 1));


if (entityValue > 0xFFFF) {
final char[] chrs = Character.toChars(entityValue);
writer.write(chrs[0]);
writer.write(chrs[1]);
} else {
writer.write(entityValue);
}


} catch (NumberFormatException ex) {
i++;
continue;
}
}
else {
// named escape
CharSequence value = lookupMap.get(input.substring(i, j));
if (value == null) {
i++;
continue;
}


if (writer == null)
writer = new StringWriter(input.length());
writer.append(input.substring(st, i - 1));


writer.append(value);
}


// skip escape
st = j + 1;
i = st;
}


if (writer != null) {
writer.append(input.substring(st, len));
return writer.toString();
}
return input;
}


private static final String[][] ESCAPES = {
{"\"",     "quot"}, // " - double-quote
{"&",      "amp"}, // & - ampersand
{"<",      "lt"}, // < - less-than
{">",      "gt"}, // > - greater-than


// Mapping to escape ISO-8859-1 characters to their named HTML 3.x equivalents.
{"\u00A0", "nbsp"}, // non-breaking space
{"\u00A1", "iexcl"}, // inverted exclamation mark
{"\u00A2", "cent"}, // cent sign
{"\u00A3", "pound"}, // pound sign
{"\u00A4", "curren"}, // currency sign
{"\u00A5", "yen"}, // yen sign = yuan sign
{"\u00A6", "brvbar"}, // broken bar = broken vertical bar
{"\u00A7", "sect"}, // section sign
{"\u00A8", "uml"}, // diaeresis = spacing diaeresis
{"\u00A9", "copy"}, // © - copyright sign
{"\u00AA", "ordf"}, // feminine ordinal indicator
{"\u00AB", "laquo"}, // left-pointing double angle quotation mark = left pointing guillemet
{"\u00AC", "not"}, // not sign
{"\u00AD", "shy"}, // soft hyphen = discretionary hyphen
{"\u00AE", "reg"}, // ® - registered trademark sign
{"\u00AF", "macr"}, // macron = spacing macron = overline = APL overbar
{"\u00B0", "deg"}, // degree sign
{"\u00B1", "plusmn"}, // plus-minus sign = plus-or-minus sign
{"\u00B2", "sup2"}, // superscript two = superscript digit two = squared
{"\u00B3", "sup3"}, // superscript three = superscript digit three = cubed
{"\u00B4", "acute"}, // acute accent = spacing acute
{"\u00B5", "micro"}, // micro sign
{"\u00B6", "para"}, // pilcrow sign = paragraph sign
{"\u00B7", "middot"}, // middle dot = Georgian comma = Greek middle dot
{"\u00B8", "cedil"}, // cedilla = spacing cedilla
{"\u00B9", "sup1"}, // superscript one = superscript digit one
{"\u00BA", "ordm"}, // masculine ordinal indicator
{"\u00BB", "raquo"}, // right-pointing double angle quotation mark = right pointing guillemet
{"\u00BC", "frac14"}, // vulgar fraction one quarter = fraction one quarter
{"\u00BD", "frac12"}, // vulgar fraction one half = fraction one half
{"\u00BE", "frac34"}, // vulgar fraction three quarters = fraction three quarters
{"\u00BF", "iquest"}, // inverted question mark = turned question mark
{"\u00C0", "Agrave"}, // А - uppercase A, grave accent
{"\u00C1", "Aacute"}, // Б - uppercase A, acute accent
{"\u00C2", "Acirc"}, // В - uppercase A, circumflex accent
{"\u00C3", "Atilde"}, // Г - uppercase A, tilde
{"\u00C4", "Auml"}, // Д - uppercase A, umlaut
{"\u00C5", "Aring"}, // Е - uppercase A, ring
{"\u00C6", "AElig"}, // Ж - uppercase AE
{"\u00C7", "Ccedil"}, // З - uppercase C, cedilla
{"\u00C8", "Egrave"}, // И - uppercase E, grave accent
{"\u00C9", "Eacute"}, // Й - uppercase E, acute accent
{"\u00CA", "Ecirc"}, // К - uppercase E, circumflex accent
{"\u00CB", "Euml"}, // Л - uppercase E, umlaut
{"\u00CC", "Igrave"}, // М - uppercase I, grave accent
{"\u00CD", "Iacute"}, // Н - uppercase I, acute accent
{"\u00CE", "Icirc"}, // О - uppercase I, circumflex accent
{"\u00CF", "Iuml"}, // П - uppercase I, umlaut
{"\u00D0", "ETH"}, // Р - uppercase Eth, Icelandic
{"\u00D1", "Ntilde"}, // С - uppercase N, tilde
{"\u00D2", "Ograve"}, // Т - uppercase O, grave accent
{"\u00D3", "Oacute"}, // У - uppercase O, acute accent
{"\u00D4", "Ocirc"}, // Ф - uppercase O, circumflex accent
{"\u00D5", "Otilde"}, // Х - uppercase O, tilde
{"\u00D6", "Ouml"}, // Ц - uppercase O, umlaut
{"\u00D7", "times"}, // multiplication sign
{"\u00D8", "Oslash"}, // Ш - uppercase O, slash
{"\u00D9", "Ugrave"}, // Щ - uppercase U, grave accent
{"\u00DA", "Uacute"}, // Ъ - uppercase U, acute accent
{"\u00DB", "Ucirc"}, // Ы - uppercase U, circumflex accent
{"\u00DC", "Uuml"}, // Ь - uppercase U, umlaut
{"\u00DD", "Yacute"}, // Э - uppercase Y, acute accent
{"\u00DE", "THORN"}, // Ю - uppercase THORN, Icelandic
{"\u00DF", "szlig"}, // Я - lowercase sharps, German
{"\u00E0", "agrave"}, // а - lowercase a, grave accent
{"\u00E1", "aacute"}, // б - lowercase a, acute accent
{"\u00E2", "acirc"}, // в - lowercase a, circumflex accent
{"\u00E3", "atilde"}, // г - lowercase a, tilde
{"\u00E4", "auml"}, // д - lowercase a, umlaut
{"\u00E5", "aring"}, // е - lowercase a, ring
{"\u00E6", "aelig"}, // ж - lowercase ae
{"\u00E7", "ccedil"}, // з - lowercase c, cedilla
{"\u00E8", "egrave"}, // и - lowercase e, grave accent
{"\u00E9", "eacute"}, // й - lowercase e, acute accent
{"\u00EA", "ecirc"}, // к - lowercase e, circumflex accent
{"\u00EB", "euml"}, // л - lowercase e, umlaut
{"\u00EC", "igrave"}, // м - lowercase i, grave accent
{"\u00ED", "iacute"}, // н - lowercase i, acute accent
{"\u00EE", "icirc"}, // о - lowercase i, circumflex accent
{"\u00EF", "iuml"}, // п - lowercase i, umlaut
{"\u00F0", "eth"}, // р - lowercase eth, Icelandic
{"\u00F1", "ntilde"}, // с - lowercase n, tilde
{"\u00F2", "ograve"}, // т - lowercase o, grave accent
{"\u00F3", "oacute"}, // у - lowercase o, acute accent
{"\u00F4", "ocirc"}, // ф - lowercase o, circumflex accent
{"\u00F5", "otilde"}, // х - lowercase o, tilde
{"\u00F6", "ouml"}, // ц - lowercase o, umlaut
{"\u00F7", "divide"}, // division sign
{"\u00F8", "oslash"}, // ш - lowercase o, slash
{"\u00F9", "ugrave"}, // щ - lowercase u, grave accent
{"\u00FA", "uacute"}, // ъ - lowercase u, acute accent
{"\u00FB", "ucirc"}, // ы - lowercase u, circumflex accent
{"\u00FC", "uuml"}, // ь - lowercase u, umlaut
{"\u00FD", "yacute"}, // э - lowercase y, acute accent
{"\u00FE", "thorn"}, // ю - lowercase thorn, Icelandic
{"\u00FF", "yuml"}, // я - lowercase y, umlaut
};


private static final int MIN_ESCAPE = 2;
private static final int MAX_ESCAPE = 6;


private static final HashMap<String, CharSequence> lookupMap;
static {
lookupMap = new HashMap<String, CharSequence>();
for (final CharSequence[] seq : ESCAPES)
lookupMap.put(seq[1].toString(), seq[0]);
}


}

下面的库也可以用于 Java 中的 HTML 转义: 解脱

HTML 可以通过这种方式取消转义:

final String unescapedText = HtmlEscape.unescapeHtml(escapedText);

在我的例子中,我通过测试每个变量中的每个实体来使用替换方法,我的代码如下所示:

text = text.replace("&Ccedil;", "Ç");
text = text.replace("&ccedil;", "ç");
text = text.replace("&Aacute;", "Á");
text = text.replace("&Acirc;", "Â");
text = text.replace("&Atilde;", "Ã");
text = text.replace("&Eacute;", "É");
text = text.replace("&Ecirc;", "Ê");
text = text.replace("&Iacute;", "Í");
text = text.replace("&Ocirc;", "Ô");
text = text.replace("&Otilde;", "Õ");
text = text.replace("&Oacute;", "Ó");
text = text.replace("&Uacute;", "Ú");
text = text.replace("&aacute;", "á");
text = text.replace("&acirc;", "â");
text = text.replace("&atilde;", "ã");
text = text.replace("&eacute;", "é");
text = text.replace("&ecirc;", "ê");
text = text.replace("&iacute;", "í");
text = text.replace("&ocirc;", "ô");
text = text.replace("&otilde;", "õ");
text = text.replace("&oacute;", "ó");
text = text.replace("&uacute;", "ú");

对我来说,这个方法非常有效。

在其他答案中提到的库将是很好的解决方案,但是如果您已经碰巧在项目中挖掘了现实世界中的 html,那么 Jsoup项目可以提供的不仅仅是管理 “ & 符号 # FFFF 分号”内容。

// textValue: <p>This is a&nbsp;sample. \"Granny\" Smith &#8211;.<\/p>\r\n
// becomes this: This is a sample. "Granny" Smith –.
// with one line of code:
// Jsoup.parse(textValue).getText(); // for older versions of Jsoup
Jsoup.parse(textValue).text();


// Another possibility may be the static unescapeEntities method:
boolean strictMode = true;
String unescapedString = org.jsoup.parser.Parser.unescapeEntities(textValue, strictMode);

您还可以使用最好的 DOM、 CSS 和类似 jquery 的方法获得用于提取和操作数据的方便 API。这是开源和麻省理工学院许可证。

没有任何外部库的一个非常简单但效率低下的解决方案是:

public static String unescapeHtml3( String str ) {
try {
HTMLDocument doc = new HTMLDocument();
new HTMLEditorKit().read( new StringReader( "<html><body>" + str ), doc, 0 );
return doc.getText( 1, doc.getLength() );
} catch( Exception ex ) {
return str;
}
}

只有在要解码的字符串数量很少的情况下才应该使用这种方法。

这就是我的工作,

import org.apache.commons.lang.StringEscapeUtils;
...
String decodedXML= StringEscapeUtils.unescapeHtml(encodedXML);

或者

import org.apache.commons.lang3.StringEscapeUtils;
...
String decodedXML= StringEscapeUtils.unescapeHtml4(encodedXML);

我想,出于显而易见的原因,使用 lang3总是更好的。 希望这对你有帮助:)

最可靠的方法是

String cleanedString = StringEscapeUtils.unescapeHtml4(originalString);

org.apache.commons.lang3.StringEscapeUtils

为了逃避空白

cleanedString = cleanedString.trim();

这将确保由于在 Web 表单中复制和粘贴而产生的空白不会被持久化到 DB 中。

Spring 框架 HtmlUtils

如果您已经在使用 Spring 框架,请使用以下方法:

import static org.springframework.web.util.HtmlUtils.htmlUnescape;


...


String result = htmlUnescape(source);


StringEscapeUtils (Apache Commons Lang)
转义和取消转义 Java、 JavaScript、 HTML 和 XML 的字符串。

import org.apache.commons.lang.StringEscapeUtils;
....
StringEscapeUtils.unescapeHtml(comment);

参考文献: < a href = “ https://comms.apache.org/pert/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html”rel = “ nofollow noReferrer”> https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/stringescapeutils.html