ansaurus

Question

Is there a good Javascript based HTML parsing library available?

Answer 1

+2 A:

You can parse HTML with jQuery, but I'm pretty sure any blacklist based (i.e. filtering out) approach to sanitizing is going to fail - you probably need a "filtering in" based approach and ultimately you don't want to be relying on JavaScript for security anyway. In any case for reference you can use jQuery for DOM-parsing like this:

var htmlS = "<html>etc.etc.";
$(htmlS).remove("script"); /* DONT RELY ON THIS FOR SECURITY */

Graphain 2010-07-04 23:43:23

Good point. In fact, you probably don't even *need* the jQuery wrapper, per se, but it would make things easier. Just let the browser itself handle the parsing, and then use the DOM methods available to you to do whatever you want.

Matchu 2010-07-04 23:45:20

Mind explaining how?

icktoofay 2010-07-04 23:45:30

@icktoofay yep edited my bad

Graphain 2010-07-04 23:46:37

Look at this web page for all the crazy ways you are vulnerable to XSS. http://ha.ckers.org/xss.html. Unfortunately, just removing the script tags is not even close to good enough...

2010-07-05 00:04:58

@gerdemb - definitely, any HTML sanitization should be implemented as a whitelist instead of a blacklist.

Matchu 2010-07-05 00:22:06

Simply parsing with jQuery or with an HTML parser doesn't even begin to describe the complexity of filtering a document for untrusted code. You can't just remove script elements. See the XSS cheat sheet that gerdemb posted above. But just for example, consider: script elements, onload attribute, onclick attribute, on<whatever>, meta elements, javascript: URLs, onfuscated javascript: URLs, object elements, applet elements, url() in CSS, and much, much more. The example in this answer is harmful in its inadequacy. Even a whitelist based approach would have to filter URLs in elements like a.

thomasrutter 2010-07-06 00:13:57

@thomasrutter absolutely agree

Graphain 2010-07-06 12:03:34

Answer 2

+2 A:

Would I be better off just sending the HTML to my Java server for sanitization?

Yes.

Filtering "unsafe" input must be done server-side. There is no other way to do it. It's not possible to do filtering client-side because the "client-side" could be a web browser or it could just as easily be a bot with a script.

thomasrutter 2010-07-05 00:00:24

Filtering unsafe input, yes, that must be done on the server because the client can harm other users by not doing the filtering it's supposed to do. This is filtering unsafe output however, and a client that doesn't filter will only harm itself. Therefore, doing this with Javascript is fine.

Bart van Heukelom 2010-07-05 00:07:14

@bart "a client that doesn't filter will only harm itself. Therefore, doing this with Javascript is fine" <- this is not entirely true as one compromised user might have the access to affect other users

Graphain 2010-07-05 01:22:03

A compromised user can do all sorts of bad things. If you filter out script tags on the server it will just put them back when rendering. Or more likely, it won't bother with that inconvenience and just run the evil code directly.

Bart van Heukelom 2010-07-05 08:13:04

@Bart van Heukelom your first comment above is true if the code never gets shared with other users or the server and is simply inserted into the current page using Javascript, which on re-reading the original question I realise that could be what the OP meant.

thomasrutter 2010-07-06 00:02:24

It's even true if it's shared with others, as long as it's properly documented that it's unchecked data (but of course, *that* isn't always the case).

Bart van Heukelom 2010-07-06 21:40:34

ansaurus

tags:

views:

answers:

Is there a good Javascript based HTML parsing library available?

related questions