views:

762

answers:

3

I am working on finding a good way to make user submitted data, in this case allow HTML and have it be as safe and fast as I can.

I know EVERY SINGLE PERSON on this site seems to think http://htmlpurifier.org is the answer here. I do agree partially. htmlpurifier has the best open source code out there for filtering user submitted HTML but there solution is very bulky and is not good for performance on a high traffic site. I might even use there solution someday but for now my goal is to find a more lightweight method.

I have been using the 2 functions below for about 2 and a half years now with no problems yet but I think it is time to take some input from the pro's on here if they will help me.

The first function is called FilterHTML($string) it is ran before user data is saved to a mysql database. The second function is called format_db_value($text, $nl2br = false) and I use it on a page where I plan to show the user submitted data.

Below the 2 functions is a bunch of the XSS codes I found on http://ha.ckers.org/xss.html and I then ran them on these 2 functions to see how affective my code is, I am somewhat pleased with the results, they did block out every code I tried but I know it is still not 100% safe obviously.

Can you guys please look over it and give me any advice for my code itself or even on the whole html filtering concept.

I would like to do a whitelist approach someday but htmlpurifier is the only solution I have found worth using for that and as I mentioned it is not lightweight as I would like.

function FilterHTML($string) {
    if (get_magic_quotes_gpc()) {
     $string = stripslashes($string);
    }
    $string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
    // convert decimal
    $string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
    // convert hex
    $string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
    //$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
    $string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
    $string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
    //$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
    $string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
    $string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*@([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //@IMPORT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
    $string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
    $string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
    $string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
    //$string = str_replace('left:0px; top: 0px;','',$string);
    do {
     $oldstring = $string;
     //bgsound|
     $string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
    } while ($oldstring != $string);
    return addslashes($string);
}

Below function is used when showing user submitted code on a webpage

function format_db_value($text, $nl2br = false) {
    if (is_array($text)) {
     $tmp_array = array();
     foreach ($text as $key => $value) {
      $tmp_array[$key] = format_db_value($value);
     }
     return $tmp_array;
    } else {
     $text = htmlspecialchars(stripslashes($text));
     if ($nl2br) {
      return nl2br($text);
     } else {
      return $text;
     }
    }
}


The codes below are from ha.ckers.org and they all seem to fail on my functions above

I did not try everyone on that site though there is a lot more, this is just some of them.
The original code is on the top line of each set and the code after running through my functions is on the line below it.

<IMG SRC="javascript:alert(\'XSS\');"><b>hello</b> hiii
<IMG SRC=...alert('XSS');"><b>hello</b> hiii

<IMG SRC=JaVaScRiPt:alert('XSS')>
<IMG SRC=...alert('XSS')>

<IMG SRC=javascript:alert(String.fromCharCode(88,83,83))>
<IMG SRC=...alert(String.fromCharCode(88,83,83))>

<IMG SRC=&#106;&#97;&#118;&#97;&#115;&#99;&#114;&#105;&#112;&#116;&#58;&#97;&#108;&#101;&#114;&#116;&#40;&#39;&#88;&#83;&#83;&#39;&#41;>
<IMG SRC=...alert('XSS')>

<IMG SRC=&#0000106&#0000097&#0000118&#0000097&#0000115&#0000099&#0000114&#0000105&#0000112&#0000116&#0000058&#0000097&#0000108&#0000101&#0000114&#0000116&#0000040&#0000039&#0000088&#0000083&#0000083&#0000039&#0000041>
<IMG SRC=F  MLEJNALN !>

<IMG SRC=&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29>
<IMG SRC=...alert('XSS')>


<IMG SRC="jav&#x0A;ascript:alert('XSS');">
<IMG SRC=...alert('XSS');">

perl -e 'print "<IMG SRC=javascript:alert("XSS")>";' > out
perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out

<BODY onload!#$%&()*~+-_.,:;?@[/|\]^`=alert("XSS")>
...

<iframe src=http://ha.ckers.org/scriptlet.html <
...

<LAYER SRC="http://ha.ckers.org/scriptlet.html"&gt;&lt;/LAYER&gt;
......

<META HTTP-EQUIV="Link" Content="<http://ha.ckers.org/xss.css&gt;; REL=stylesheet">
...; REL=stylesheet">

<IMG STYLE="xss:...(alert('XSS'))">
<IMG STYLE="xss:expr/*XSS*/ession(alert('XSS'))">

<XSS STYLE="xss:...(alert('XSS'))">
<XSS STYLE="xss:expression(alert('XSS'))">

<EMBED SRC="data:image/svg+xml;base64,PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED>

<EMBED SRC="data:image/svg+xml;base64,PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED>


<IMG
SRC
=
"
j
a
v
a
s
c
r
i
p
t
:
a
l
e
r
t
(
'
X
S
S
'
)
"
>

<IMG
SRC
=...
a
l
e
r
t
(
'
X
S
S
'
)
"
>
+3  A: 

Only way to be sure is to whitelist the tags and attributes that they can use and write strict regexps to validate allowed values of attributes. If you want to allow attributes such as "style" then you have additional complexity.

Blacklisting only might make attack for some people harder but it will not make it any harder for the person that uses technique you have not heard of yet.

I'd try using regexp to add missing closing tags to what users entered and replace <br> with <br /> and so on, then parse it using SimpleXML, then iterate over it and remove any tag that is not in whitelist, any attribute that is not in the whitelist for given tag, and any attribute that has a value that does conform to precise regexp for this attribute. After all I'd use asXML() to get the text back. I'd start with minimal set of tags and attributes and add new ones as needed being especially careful of anything that may contain url.

Kamil Szot
True I am looking for a good whitelist method available to the public beside html purifier, surely they have to exist, so many sites allow html
jasondavis
There is antisamy - http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project - but it has not good PHP implementation.
Anti Veeranna
+2  A: 

Here is four alternatives :

Boris Guéry
Thanks for the links, I have checked most of them out and most don't seem to do the job in today's code. Here is a link comparing them all http://htmlpurifier.org/comparison
jasondavis
I should add, HTML_Filter looks like something I could possibly use +1
jasondavis
A: 

IMHO htmlawed is the best -- lean, fast, full HTML coverage, most flexible... black OR white list for tags AND attributes. Safe? Defeats all the ha.ckers XSS codes

Ludovic