views:

67

answers:

1

This is related to http://stackoverflow.com/questions/3290766/htmlpurifier-adding-to-ignore-list. I have added a couple tags to the whitelist. I have this code now -

$config->set('HTML', 'AllowedElements', array("customreport", "column", "columnseq"));

$def = $config->getHTMLDefinition(true);
$def->addElement("customreport", 'Block', 'Flow', 'Common', array());
$def->addElement("column", 'Block', 'Inline', 'Common', array());
$def->addElement("columnseq", 'Inline', 'Empty', 'Common', array('path'=>'CDATA', 'label'=>'CDATA'));

The problem is, if I send a html tag which has the attribute value in single-quotes, htmlpurifier changes it to double-quotes. For e.g.

<columnseq path='test' label='tlabel' />

It happens even on the demo site (http://htmlpurifier.org/demo.php), with test string

<A HREF='http://www.google.com/'&gt;XSS&lt;/A&gt;

Can this behavior be over-ridden?

+2  A: 

The canonicalization of attribute quoting to double-quotes was an intentional design decision stemming from the fact that when we construct our in-memory representation of the HTML, we only have an associative array of attribute names to values, and no information about what the original quoting style was. If you use the DOM style parser, there is no way to get that information either.

Edward Z. Yang
hmmm.. the design decision makes sense to me though it has created a peculiar problem on my end. Anyway, only a problem with false positives, so i guess that is okay. thanks for the answer..
pinaki
While HTML Purifier tries to be as syntax preserving as is convenient, it doesn't really go beyond that. And you are probably going to get lots of false positives from users submitting not well-formed HTML. Better to look for strings traditionally associated with XSS.
Edward Z. Yang
@Ambush Commander - what exactly do you mean by "strings traditionally associated with XSS"?? can you give an example?
pinaki
Also, is it possible to get the cleanup information from htmlpurifier? what i mean is, can htmlpurifier return me the information when the input contains any xss?
pinaki
Strings associated with XSS are basically any type of string that might be associated with JavaScript, like javascript or <script> or onSomeEvent. You can find a big catalog of these in a Web Intrusion Detection System (try phpids, maybe?) You can get cleanup information with the experiment Core.CollectErrors option (check the docs for more details), but that doesn't necessarily equal XSS attempt.
Edward Z. Yang