advantages from htmlpurifier instead of regex filtering

views:

answers:

+1 Q:

advantages from htmlpurifier instead of regex filtering

We have recently implemented htmlpurifier in our web-based application. Earlier we used to have regexes to match commonly known XSS injections (script, img, etc. etc). We realized that this wasn't good enough and hence moved to htmlpurifier. Now given that htmlpurifier is slow in working (very slow compared to the regex method we had earlier), is it really worth to have htmlpurifier? Or does it make any sense to keep increasing the regex filtering until we reach a satisfactory level (it might be argued that the speed benefits would be nullified by that time). Anyone else who has faced similar issues with security for their web application and what did you do in the end?

Please let know if anything seems vague; I would be happy to provide more details.

+1 A:

It's better to be safe than sorry. There's a whole slew of attacks your regular expressions might not find. For example, here's just a few. If HTML Purifier is too slow, see if caching the purified HTML helps.

icktoofay 2010-08-05 04:58:12

thanks for the answer. i am already caching the purifier HTML, but even with caching there is a difference of about .5 seconds in the page load time before and after html purifier..

pinaki 2010-08-05 05:07:37

@pinaki: I don't know exactly how you're caching it, but if you cache it in a column in the same table as the unpurified HTML, (I assume you're using an RDBMS) then it should be just as quick as the regex approach, if not quicker.

icktoofay 2010-08-05 05:12:33

Try caching the HTML on write, not on read. Then it should make no difference to the speed of the site.

TRiG 2010-09-06 16:17:14

+1 A:

The problem with regexes is that filtering HTML is too complex a task to be able to do easily, or elegantly, with regexes without creating a big mess.

You need to build something that actually understands HTML and can operate on it as HTML, and know how a browser is going to interpret something. Regexes operate on it as if it's just one big long string. They're not good or elegant at parsing HTML in a stateful manner, for example recognising that a current match is within a comment, or within an attribute, or within a element etc. It's just really complicated to emulate that in regexes.

The other issue is that 'matching commonly known XSS injections' is way more complex than it sounds. If it isn't, you're not doing it right. Your filter needs to know HTML, it needs to know what a valid URL scheme is and how null bytes work in different parts of HTML etc. Basically, most of the injections on the XSS cheat sheet, for example, are based on getting around filtering done by regex-based filters.

And one more thing is that HTML purifier is maintained by someone who knows what they're doing. You can trust it, and you can trust that if there's a new flaw in it it'll be patched. That can save you a lot of work trying to do the same thing on your own and ensure that you remain up to date with all of the different patches out there.

thomasrutter 2010-08-05 04:58:54

agreed. updating is one of the best reasons even i can think of :)..

pinaki 2010-08-05 05:05:40

+2 A:

Using a regex for html/javascript? Perhaps you have not seen this epic answer by Mr Bobice. In short if you use a regex then you have two problems. In fact the reason why HTML Purifier is so slow is because it uses hundreds of calls to preg_match() and preg_repalce() in order to clean a message. You must never re-invent the wheal, without a doubt be less secure.

The real question is htmlspeicalchars($var,ENT_QUOTES); vs HTML Purifier. HTML Purifer is not only slow, it has been hacked, many times. Don't use HTML Purifier unless there is no other choice, htmlspeicalchars solves most problems and it solves it in a way that cannot be bypassed.

Rook 2010-08-05 05:18:19

hmmm.. great links.. lots to think of.. i guess yours is the best answer around.. thanks...

pinaki 2010-08-05 07:38:16

@pinaki your welcome.

Rook 2010-08-05 07:44:48

ansaurus

tags:

views:

answers:

advantages from htmlpurifier instead of regex filtering

related questions