How do you make strings "XML safe"?

I am responding to an AJAX call by sending it an XML document through PHP echos. In order to form this XML document, I loop through the records of a database. The problem is that the database includes records that have '<' symbols in them. So naturally, the browser throws an error at that particular spot. How can this be fixed?

116004 次浏览

By either escaping those characters with htmlspecialchars, or, perhaps more appropriately, using a library for building XML documents, such as DOMDocument or XMLWriter.

Another alternative would be to use CDATA sections, but then you'd have to look out for occurrences of ]]>.

Take also into consideration that that you must respect the encoding you define for the XML document (by default UTF-8).

1) You can wrap your text as CDATA like this:

<mytag>
<![CDATA[Your text goes here. Btw: 5<6 and 6>5]]>
</mytag>

see http://www.w3schools.com/xml/xml_cdata.asp

2) As already someone said: Escape those chars. E.g. like so:

5&lt;6 and 6&gt;5

If at all possible, its always a good idea to create your XML using the XML classes rather than string manipulation - one of the benefits being that the classes will automatically escape characters as needed.

Try this:

$str = htmlentities($str,ENT_QUOTES,'UTF-8');

So, after filtering your data using htmlentities() function, you can use the data in XML tag like:

<mytag>$str</mytag>

Adding this in case it helps someone.

As I am working with Japanese characters, encoding has also been set appropriately. However, from time to time, I find that htmlentities and htmlspecialchars are not sufficient.

Some user inputs contain special characters that are not stripped by the above functions. In those cases I have to do this:

preg_replace('/[\x00-\x1f]/','',htmlspecialchars($string))

This will also remove certain xml-unsafe control characters like Null character or EOT. You can use this table to determine which characters you wish to omit.

Since PHP 5.4 you can use:

htmlspecialchars($string, ENT_XML1);

You should specify the encoding, such as:

htmlspecialchars($string, ENT_XML1, 'UTF-8');

Update

Note that the above will only convert:

  • & to &amp;
  • < to &lt;
  • > to &gt;

If you want to escape text for use in an attribute enclosed in double quotes:

htmlspecialchars($string, ENT_XML1 | ENT_COMPAT, 'UTF-8');

will convert " to &quot; in addition to &, < and >.


And if your attributes are enclosed in single quotes:

htmlspecialchars($string, ENT_XML1 | ENT_QUOTES, 'UTF-8');

will convert ' to &apos; in addition to &, <, > and ".

(Of course you can use this even outside of attributes).


See the manual entry for htmlspecialchars.

I prefer the way Golang does quote escaping for XML (and a few extras like newline escaping, and escaping some other characters), so I have ported its XML escape function to PHP below

function isInCharacterRange(int $r): bool {
return $r == 0x09 ||
$r == 0x0A ||
$r == 0x0D ||
$r >= 0x20 && $r <= 0xDF77 ||
$r >= 0xE000 && $r <= 0xFFFD ||
$r >= 0x10000 && $r <= 0x10FFFF;
}


function xml(string $s, bool $escapeNewline = true): string {
$w = '';


$Last = 0;
$l = strlen($s);
$i = 0;


while ($i < $l) {
$r = mb_substr(substr($s, $i), 0, 1);
$Width = strlen($r);
$i += $Width;
switch ($r) {
case '"':
$esc = '&#34;';
break;
case "'":
$esc = '&#39;';
break;
case '&':
$esc = '&amp;';
break;
case '<':
$esc = '&lt;';
break;
case '>':
$esc = '&gt;';
break;
case "\t":
$esc = '&#x9;';
break;
case "\n":
if (!$escapeNewline) {
continue 2;
}
$esc = '&#xA;';
break;
case "\r":
$esc = '&#xD;';
break;
default:
if (!isInCharacterRange(mb_ord($r)) || (mb_ord($r) === 0xFFFD && $Width === 1)) {
$esc = "\u{FFFD}";
break;
}


continue 2;
}
$w .= substr($s, $Last, $i - $Last - $Width) . $esc;
$Last = $i;
}
$w .= substr($s, $Last);
return $w;
}

Note you'll need at least PHP7.2 because of the mb_ord usage, or you'll have to swap it out for another polyfill, but these functions are working great for us!

For anyone curious, here is the relevant Go source https://golang.org/src/encoding/xml/xml.go?s=44219:44263#L1887