tags:

views:

343

answers:

7

I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don't want but I am now wanting to change and use a whitelist approach.

Here is my current non-whitelist approach

function FilterHTML($string) {
    if (get_magic_quotes_gpc()) {
     $string = stripslashes($string);
    }
    $string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
    // convert decimal
    $string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
    // convert hex
    $string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
    //$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
    $string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
    $string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
    //$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
    $string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
    $string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*@([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //@IMPORT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
    $string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
    $string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
    $string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
    //$string = str_replace('left:0px; top: 0px;','',$string);
    do {
     $oldstring = $string;
     //bgsound|
     $string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
    } while ($oldstring != $string);
    return addslashes($string);
}

The above works pretty well, I have never had any problems after 2 years of use with it but for a whitelist approach is there anything similars to stackoverflows C# method but in PHP? http://refactormycode.com/codes/333-sanitize-html

+5  A: 

Maybe it is safer to use DOMDocument to analyze it correctly, remove disallowed tags with removeChild() and then get the result. It is not always safe to filter stuff with regular expressions, specially if things start to get such complexity. Hackers can find a way to cheat your filters, forums and social networks do know that very well.

For instance, browsers ignore spaces after the <. Your regex filter <script, but if I use < script... big FAIL!

Havenard
A: 

It's a pretty simple aim to achieve actually - you just need to check for anything that's NOT some tags from a list of whitelisted tags and remove them from the source. It can be done quite easily with one regex.

function sanitize($html) {
  $whitelist = array(
    'b', 'i', 'u', 'strong', 'em', 'a'
  );

  return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}

I haven't tested this, and there's probably an error in there somewhere but you get the gist of how it works. You might also want to look at using a formatting language such as Textile or Markdown.

Jamie

Jamie Rumbelow
+3  A: 

HTML Purifier is the best HTML parser/cleaner out there.

MiffTheFox
+7  A: 

HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.

raspi
Using php, this is truly the way to go. It's output is amazing and safe.
DGM
I have seen this before I thought it was really bulky though, I will check it out again though, thanks
jasondavis
A: 

You can just use the strip_tags() function

Since the function is defined as

string strip_tags  ( string $str  [, string $allowable_tags  ] )

You can do this:

$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');

But take note that using strip_tags, you won't be able to filter off the attributes. e.g.

<a href="javascript:alert('haha caught cha!');">link</a>
thephpdeveloper
A: 

Try this function "getCleanHTML" below, extract text content from the elements with exceptions of elements with tag name in the whitelist. This code is clean and easy to understand and debug.

<?php

$TagWhiteList = array(
    'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);

function getHTMLCode($Node) {
    $Document = new DOMDocument();    
    $Document->appendChild($Document->importNode($Node, true));
    return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
    global $TagWhiteList;

    $TextName = $Node->tagName;
    if ($TextName == null)
        return $Text.$Node->textContent;

    if (in_array($TextName, $TagWhiteList)) 
        return $Text.getHTMLCode($Node);

    $Node = $Node->firstChild;
    if ($Node != null)
        $Text = getCleanHTML($Node, $Text);

    while($Node->nextSibling != null) {
        $Text = getCleanHTML($Node->nextSibling, $Text);
        $Node = $Node->nextSibling;
    }
    return $Text;
}

$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";

?>

Hope this helps.

NawaMan
A: 

For those of you suggesting simply using strip_tags...be aware: strip_tags will NOT strip out tag attributes and broken tags will also mess it up.

From the manual page:

Warning Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.

Warning This function does not modify any attributes on the tags that you allow using allowable_tags , including the style and onmouseover attributes that a mischievous user may abuse when posting text that will be shown to other users.

You CANNOT rely on just this one solution.

iddqd