I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don't want but I am now wanting to change and use a whitelist approach.
Here is my current non-whitelist approach
function FilterHTML($string) {
if (get_magic_quotes_gpc()) {
$string = stripslashes($string);
}
$string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
// convert decimal
$string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
// convert hex
$string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
//$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
$string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
//$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
$string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
$string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*@([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //@IMPORT
$string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
$string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
//$string = str_replace('left:0px; top: 0px;','',$string);
do {
$oldstring = $string;
//bgsound|
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
} while ($oldstring != $string);
return addslashes($string);
}
The above works pretty well, I have never had any problems after 2 years of use with it but for a whitelist approach is there anything similars to stackoverflows C# method but in PHP? http://refactormycode.com/codes/333-sanitize-html