views:

80

answers:

2

My goal is to take HTML entered by an end user, remove certain unsafe tags like <script>, and add it to the document. Does anybody know of a good Javascript library to sanitize html?

I searched around and found a few online, including John Resig's HTML parser, Erik Arvidsson's simple html parser, and Google's Caja Sanitizer, but I haven't been able to find much information about whether people have had good experiences using these libraries, and I'm worried that they aren't really robust enough to handle arbitrary HTML. Would I be better off just sending the HTML to my Java server for sanitization?

+2  A: 

You can parse HTML with jQuery, but I'm pretty sure any blacklist based (i.e. filtering out) approach to sanitizing is going to fail - you probably need a "filtering in" based approach and ultimately you don't want to be relying on JavaScript for security anyway. In any case for reference you can use jQuery for DOM-parsing like this:

var htmlS = "<html>etc.etc.";
$(htmlS).remove("script"); /* DONT RELY ON THIS FOR SECURITY */
Graphain
Good point. In fact, you probably don't even *need* the jQuery wrapper, per se, but it would make things easier. Just let the browser itself handle the parsing, and then use the DOM methods available to you to do whatever you want.
Matchu
Mind explaining how?
icktoofay
@icktoofay yep edited my bad
Graphain
Look at this web page for all the crazy ways you are vulnerable to XSS. http://ha.ckers.org/xss.html. Unfortunately, just removing the script tags is not even close to good enough...
@gerdemb - definitely, any HTML sanitization should be implemented as a whitelist instead of a blacklist.
Matchu
Simply parsing with jQuery or with an HTML parser doesn't even begin to describe the complexity of filtering a document for untrusted code. You can't just remove script elements. See the XSS cheat sheet that gerdemb posted above. But just for example, consider: script elements, onload attribute, onclick attribute, on<whatever>, meta elements, javascript: URLs, onfuscated javascript: URLs, object elements, applet elements, url() in CSS, and much, much more. The example in this answer is harmful in its inadequacy. Even a whitelist based approach would have to filter URLs in elements like a.
thomasrutter
@thomasrutter absolutely agree
Graphain
+2  A: 

Would I be better off just sending the HTML to my Java server for sanitization?

Yes.

Filtering "unsafe" input must be done server-side. There is no other way to do it. It's not possible to do filtering client-side because the "client-side" could be a web browser or it could just as easily be a bot with a script.

thomasrutter
Filtering unsafe input, yes, that must be done on the server because the client can harm other users by not doing the filtering it's supposed to do. This is filtering unsafe output however, and a client that doesn't filter will only harm itself. Therefore, doing this with Javascript is fine.
Bart van Heukelom
@bart "a client that doesn't filter will only harm itself. Therefore, doing this with Javascript is fine" <- this is not entirely true as one compromised user might have the access to affect other users
Graphain
A compromised user can do all sorts of bad things. If you filter out script tags on the server it will just put them back when rendering. Or more likely, it won't bother with that inconvenience and just run the evil code directly.
Bart van Heukelom
@Bart van Heukelom your first comment above is true if the code never gets shared with other users or the server and is simply inserted into the current page using Javascript, which on re-reading the original question I realise that could be what the OP meant.
thomasrutter
It's even true if it's shared with others, as long as it's properly documented that it's unchecked data (but of course, *that* isn't always the case).
Bart van Heukelom