You can't anticipate every possible weird type of malformed markup that some browser somewhere might trip over to escape blacklisting, so don't blacklist. There are many more structures you might need to remove than just script/embed/object and handlers.
Instead attempt to parse the HTML into elements and attributes in a hierarchy, then run all element and attribute names against an as-minimal-as-possible whitelist. Also check any URL attributes you let through against a whitelist (remember there are more dangerous protocols than just javascript:).
If the input is well-formed XHTML the first part of the above is much easier.
As always with HTML sanitisation, if you can find any other way to avoid doing it, do that instead. There are many, many potential holes. If the major webmail services are still finding exploits after this many years, what makes you think you can do better?
Update 2016: There is now a Google Closure package based on the Caja sanitizer.
It has a cleaner API, was rewritten to take into account APIs available on modern browsers, and interacts better with Closure Compiler.
Shameless plug: see caja/plugin/html-sanitizer.js for a client side html sanitizer that has been thoroughly reviewed.
It is white-listed, not black-listed, but the whitelists are configurable as per CajaWhitelists
If you want to remove all tags, then do the following:
var tagBody = '(?:[^"\'>]|"[^"]*"|\'[^\']*\')*';
var tagOrComment = new RegExp(
'<(?:'
// Comment body.
+ '!--(?:(?:-*[^->])*--+|-?)'
// Special "raw text" elements whose content should be elided.
+ '|script\\b' + tagBody + '>[\\s\\S]*?</script\\s*'
+ '|style\\b' + tagBody + '>[\\s\\S]*?</style\\s*'
// Regular name
+ '|/?[a-z]'
+ tagBody
+ ')>',
'gi');
function removeTags(html) {
var oldHtml;
do {
oldHtml = html;
html = html.replace(tagOrComment, '');
} while (html !== oldHtml);
return html.replace(/</g, '<');
}
People will tell you that you can create an element, and assign innerHTML and then get the innerText or textContent, and then escape entities in that. Do not do that. It is vulnerable to XSS injection since <img src=bogus onerror=alert(1337)> will run the onerror handler even if the node is never attached to the DOM.
Never trust the client. If you're writing a server application, assume that the client will always submit unsanitary, malicious data. It's a rule of thumb that will keep you out of trouble. If you can, I would advise doing all validation and sanitation in server code, which you know (to a reasonable degree) won't be fiddled with. Perhaps you could use a serverside web application as a proxy for your clientside code, which fetches from the 3rd party and does sanitation before sending it to the client itself?
[edit] I'm sorry, I misunderstood the question. However, I stand by my advice. Your users will probably be safer if you sanitize on the server before sending it to them.
The following is not tested though I have using treewalkers for some time now and they are one of the most undervalued parts of JavaScript. Here is a list of the node types you can crawl, usually I use SHOW_ELEMENT or SHOW_TEXT.
function xhtml_cleaner(id)
{
var e = document.getElementById(id);
var f = document.createDocumentFragment();
f.appendChild(e.cloneNode(true));
var walker = document.createTreeWalker(f,NodeFilter.SHOW_ELEMENT,null,false);
while (walker.nextNode())
{
var c = walker.currentNode;
if (c.hasAttribute('contentEditable')) {c.removeAttribute('contentEditable');}
if (c.hasAttribute('style')) {c.removeAttribute('style');}
if (c.nodeName.toLowerCase()=='script') {element_del(c);}
}
alert(new XMLSerializer().serializeToString(f));
return f;
}
function element_del(element_id)
{
if (document.getElementById(element_id))
{
document.getElementById(element_id).parentNode.removeChild(document.getElementById(element_id));
}
else if (element_id)
{
element_id.parentNode.removeChild(element_id);
}
else
{
alert('Error: the object or element \'' + element_id + '\' was not found and therefore could not be deleted.');
}
}
The Google Caja HTML sanitizer can be made "web-ready" by embedding it in a web worker. Any global variables introduced by the sanitizer will be contained within the worker, plus processing takes place in its own thread.
For browsers that do not support Web Workers, we can use an iframe as a separate environment for the sanitizer to work in. Timothy Chien has a polyfill that does just this, using iframes to simulate Web Workers, so that part is done for us.
Include html-sanitizer-minified.js or html-css-sanitizer-minified.js in your page
Call html_sanitize(...)
The worker script only needs to follow those instructions:
importScripts('html-css-sanitizer-minified.js'); // or 'html-sanitizer-minified.js'
var urlTransformer, nameIdClassTransformer;
// customize if you need to filter URLs and/or ids/names/classes
urlTransformer = nameIdClassTransformer = function(s) { return s; };
// when we receive some HTML
self.onmessage = function(event) {
// sanitize, then send the result back
postMessage(html_sanitize(event.data, urlTransformer, nameIdClassTransformer));
};
(A bit more code is needed to get the simworker library working, but it's not important to this discussion.)
Now that all major browsers support sandboxed iframes, there is a much simpler way that I think can be secure. I'd love it if this answer could be reviewed by people who are more familiar with this kind of security issue.
NOTE: This method definitely will not work in IE 9 and earlier. See this table for browser versions that support sandboxing. (Note: the table seems to say it won't work in Opera Mini, but I just tried it, and it worked.)
The idea is to create a hidden iframe with JavaScript disabled, paste your untrusted HTML into it, and let it parse it. Then you can walk the DOM tree and copy out the tags and attributes that are considered safe.
The whitelists shown here are just examples. What's best to whitelist would depend on the application. If you need a more sophisticated policy than just whitelists of tags and attributes, that can be accommodated by this method, though not by this example code.
var tagWhitelist_ = {
'A': true,
'B': true,
'BODY': true,
'BR': true,
'DIV': true,
'EM': true,
'HR': true,
'I': true,
'IMG': true,
'P': true,
'SPAN': true,
'STRONG': true
};
var attributeWhitelist_ = {
'href': true,
'src': true
};
function sanitizeHtml(input) {
var iframe = document.createElement('iframe');
if (iframe['sandbox'] === undefined) {
alert('Your browser does not support sandboxed iframes. Please upgrade to a modern browser.');
return '';
}
iframe['sandbox'] = 'allow-same-origin';
iframe.style.display = 'none';
document.body.appendChild(iframe); // necessary so the iframe contains a document
iframe.contentDocument.body.innerHTML = input;
function makeSanitizedCopy(node) {
if (node.nodeType == Node.TEXT_NODE) {
var newNode = node.cloneNode(true);
} else if (node.nodeType == Node.ELEMENT_NODE && tagWhitelist_[node.tagName]) {
newNode = iframe.contentDocument.createElement(node.tagName);
for (var i = 0; i < node.attributes.length; i++) {
var attr = node.attributes[i];
if (attributeWhitelist_[attr.name]) {
newNode.setAttribute(attr.name, attr.value);
}
}
for (i = 0; i < node.childNodes.length; i++) {
var subCopy = makeSanitizedCopy(node.childNodes[i]);
newNode.appendChild(subCopy, false);
}
} else {
newNode = document.createDocumentFragment();
}
return newNode;
};
var resultElement = makeSanitizedCopy(iframe.contentDocument.body);
document.body.removeChild(iframe);
return resultElement.innerHTML;
};
SECURITY HOLE: Commenter @Explosion points out that an href attribute can contain JavaScript, like <a href="javascript:alert('Oops')">. It should be possible to catch that and remove it in the sanitization code, but the above code has not (yet) been updated to do that.
Note that I'm disallowing style attributes and tags in this example. If you allowed them, you'd probably want to parse the CSS and make sure it's safe for your purposes.
I've tested this on several modern browsers (Chrome 40, Firefox 36 Beta, IE 11, Chrome for Android), and on one old one (IE 8) to make sure it bailed before executing any scripts. I'd be interested to know if there are any browsers that have trouble with it, or any edge cases that I'm overlooking.
The Google Caja library suggested above was way too complex to configure and include in my project for a Web application (so, running on the browser). What I resorted to instead, since we already use the CKEditor component, is to use it's built-in HTML sanitizing and whitelisting function, which is far more easier to configure. So, you can load a CKEditor instance in a hidden iframe and do something like:
Now, granted, if you're not using CKEditor in your project this may be a bit of an overkill, since the component itself is around half a megabyte (minimized), but if you have the sources, maybe you can isolate the code doing the whitelisting (CKEDITOR.htmlParser?) and make it much shorter.
So, it's 2016, and I think many of us are using npm modules in our code now. sanitize-html seems like the leading option on npm, though there are others.
Other answers to this question provide great input in how to roll your own, but this is a tricky enough problem that well-tested community solutions are probably the best answer.
Run this on the command line to install:
npm install --save sanitize-html
ES5:
var sanitizeHtml = require('sanitize-html');
// ...
var sanitized = sanitizeHtml(htmlInput);
ES6:
import sanitizeHtml from 'sanitize-html';
// ...
let sanitized = sanitizeHtml(htmlInput);
We wrote a "web-only" (i.e. "requires a browser") open source library for this, https://github.com/jitbit/HtmlSanitizer that removes all tags/attributes/styles except the "whitelisted" ones.
Usage:
var input = HtmlSanitizer.SanitizeHtml("<script> Alert('xss!'); </scr"+"ipt>");
P.S. Works much faster than a "pure JavaScript" solution since it uses the browser to parse and manipulate DOM. If you're interested in a "pure JS" solution please try https://github.com/punkave/sanitize-html (not affiliated)
Instead of using regex,I thought of a way using native DOM stuff. This way you can parse the HTML to a doc, get that HTML and easily get all of a certain element and whitelist elements and attributes to remove. This uses a list of attributes as either an array of simple strings of attributes to allow, or it can use a regex to validate their values and only allow on certain tags.
const sanitize = (html, tags = undefined, attributes = undefined) => {
var attributes = attributes || [
{ attribute: "src", tags: "*", regex: /^(?:https|http|\/\/):/ },
{ attribute: "href", tags: "*", regex: /^(?!javascript:).+/ },
{ attribute: "width", tags: "*", regex: /^[0-9]+$/ },
{ attribute: "height", tags: "*", regex: /^[0-9]+$/ },
{ attribute: "id", tags: "*", regex: /^[a-zA-Z]+$/ },
{ attribute: "class", tags: "*", regex: /^[a-zA-Z ]+$/ },
{ attribute: "value", tags: ["INPUT", "TEXTAREA"], regex: /^.+$/ },
{ attribute: "checked", tags: ["INPUT"], regex: /^(?:true|false)+$/ },
{
attribute: "placeholder",
tags: ["INPUT", "TEXTAREA"],
regex: /^.+$/,
},
{
attribute: "alt",
tags: ["IMG", "AREA", "INPUT"],
//"^" and "$" match beggining and end
regex: /^[0-9a-zA-Z]+$/,
},
{ attribute: "autofocus", tags: ["INPUT"], regex: /^(?:true|false)+$/ },
{ attribute: "for", tags: ["LABEL", "OUTPUT"], regex: /^[a-zA-Z0-9]+$/ },
]
var tags = tags || [
"I",
"P",
"B",
"BODY",
"HTML",
"DEL",
"INS",
"STRONG",
"SMALL",
"A",
"IMG",
"CITE",
"FIGCAPTION",
"ASIDE",
"ARTICLE",
"SUMMARY",
"DETAILS",
"NAV",
"TD",
"TH",
"TABLE",
"THEAD",
"TBODY",
"NAV",
"SPAN",
"BR",
"CODE",
"PRE",
"BLOCKQUOTE",
"EM",
"HR",
"H1",
"H2",
"H3",
"H4",
"H5",
"H6",
"DIV",
"MAIN",
"HEADER",
"FOOTER",
"SELECT",
"COL",
"AREA",
"ADDRESS",
"ABBR",
"BDI",
"BDO",
]
attributes = attributes.map((el) => {
if (typeof el === "string") {
return { attribute: el, tags: "*", regex: /^.+$/ }
}
let output = el
if (!el.hasOwnProperty("tags")) {
output.tags = "*"
}
if (!el.hasOwnProperty("regex")) {
output.regex = /^.+$/
}
return output
})
var el = new DOMParser().parseFromString(html, "text/html")
var elements = el.querySelectorAll("*")
for (let i = 0; i < elements.length; i++) {
const current = elements[i]
let attr_list = get_attributes(current)
for (let j = 0; j < attr_list.length; j++) {
const attribute = attr_list[j]
if (!attribute_matches(current, attribute)) {
current.removeAttribute(attr_list[j])
}
}
if (!tags.includes(current.tagName)) {
current.remove()
}
}
return el.documentElement.innerHTML
function attribute_matches(element, attribute) {
let output = attributes.filter((attr) => {
let returnval =
attr.attribute === attribute &&
(attr.tags === "*" || attr.tags.includes(element.tagName)) &&
attr.regex.test(element.getAttribute(attribute))
return returnval
})
return output.length > 0
}
function get_attributes(element) {
for (
var i = 0, atts = element.attributes, n = atts.length, arr = [];
i < n;
i++
) {
arr.push(atts[i].nodeName)
}
return arr
}
}
<h1>Sanitize HTML client side</h1>
<textarea id='input' placeholder="Unsanitized HTML">
<!-- This removes both the src and onerror attributes because src is not a valid url. -->
<img src="error" onerror="alert('XSS')">
<div id="something_harmless" onload="alert('More XSS')">
<b>Bold text!</b> and <em>Italic text!</em>, some more text. <del>Deleted text!</del>
</div>
<script>
alert("This would be XSS");
</script>
</textarea>
<textarea id='output' placeholder="Sanitized HTML will appear here" readonly></textarea>
<script>
document.querySelector("#input").onkeyup = () => {
document.querySelector("#output").value = sanitize(document.querySelector("#input").value);
}
</script>