ansaurus

Question

Strict HTML Validation and Filtering in PHP

Answer 1

+4 A:

User-submitted HTML isn't always valid, or indeed complete. Browsers will interpret a wide range of invalid HTML and you should make sure you can catch it.

Also be aware of the valid-looking:

<img src="http://www.mysite.com/logout" />

and

<a href="javascript:alert('xss hole');">click</a>

Ross 2008-10-13 21:19:54

Thanks Ross, these are excellent examples of kinds of input that should be filtered out. But the answer I'm looking for will include methods and solutions also.

Barry Austin 2008-10-14 14:52:02

The first example (which is a reference to a codinghorror article: http://www.codinghorror.com/blog/archives/001171.html) is not really relevant since the 'hole' depends upon the nature of that URL, rather than the syntax of this particular HTML snippet.

Bobby Jack 2008-10-14 15:26:38

There are still useful rules that could be applied to the first one, for example "allow <img> tag only when the src attribute matches the regex /^http:\/\/localsite.com\/uploaded_images\/[\w-]*\.(png|jpg|gif)$/i".

Barry Austin 2008-10-14 16:21:15

Answer 2

A:

The W3C has a big open-source package for validating HTML available here:

http://validator.w3.org/

You can download the package for yourself and probably implement whatever they're doing. Unfortunately, it seems like a lot of DOM parsers seem to be willing to bend the rules to allot for HTML code "in the wild" as it were, so it's a good idea to let the masters tell you what's wrong and not leave it to a more practical tool--there are a lot of websites out there that aren't perfect, compliant HTML but that we still use every day.

Robert Elwell 2008-10-13 21:35:25

Validation against DTD doesn't protect against XSS at all.

porneL 2008-10-13 21:40:11

Exactly, I don't think that's what Barry meant with validation - think data validation or screening rather than standards validation. This would help against malformed HTML though ;)

Ross 2008-10-13 21:42:52

Answer 3

+5 A:

I've tested all exploits I know on HTML Purifier and it did very well. It filters not only HTML, but also CSS and URLs.

Once you narrow elements and attributes to innocent ones, the pitfalls are in attribute content – javascript: pseudo-URLs (IE allows tab characters in protocol name - java	script: still works) and CSS properties that trigger JS.

Parsing of URLs may be tricky, e.g. these are valid: http://spoof.com:[email protected] or //evil.com. Internationalized domains (IDN) can be written in two ways – Unicode and punycode.

Go with HTML Purifier – it has most of these worked out. If you just want to fix broken HTML, then use HTML Tidy (it's available as PHP extension).

porneL 2008-10-13 21:39:12

... hint: http://htmlpurifier.org/

BlaM 2008-10-13 21:41:34

Thanks for your answer!

Barry Austin 2008-10-16 21:39:47

Answer 4

+1 A:

I used HTML Purifier with success and haven't had any xss or other unwanted input filter through. I also run the sanitize HTML through the Tidy extension to make sure it validates as well.

2008-10-13 22:08:31

ansaurus

tags:

views:

answers:

Strict HTML Validation and Filtering in PHP

related questions