Assuming this is going to be put into HTML content (such as between <body>
and </body>
or between <div>
and </div>
), you need to encode the 5 special XML characters (&, <, >, ", '), and OWASP recommends including slash (/) as well. The PHP builtin, htmlentities()
will do the first part for you, and a simple str_replace()
can do the slash:
function makeHTMLSafe($string) {
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
$string = str_replace('/', '/', $string);
return $string;
}
If, however, you're going to be putting the tainted value into an HTML attribute, such as the href=
clause of an <a
, then you'll need to encode a different set of characters ([space] % * + , - / ; < = > ^ and |)—and you must double-quote your HTML attributes:
function makeHTMLAttributeSafe($string) {
$scaryCharacters = array(32, 37, 42, 43, 44, 45, 47, 59, 60, 61, 62, 94, 124);
$translationTable = array();
foreach ($scaryCharacters as $num) {
$hex = str_pad(dechex($num), 2, '0', STR_PAD_LEFT);
$translationTable[chr($num)] = '&#x' . $hex . ';';
}
$string = strtr($string, $translationTable);
return $string;
}
The final concern is illegal UTF-8 characters—when delivered to some browsers, an ill-formed UTF-8 byte sequence can break out of an HTML entity. To protect against this, simply ensure that all the UTF-8 characters you get are valid:
function assertValidUTF8($string) {
if (strlen($string) AND !preg_match('/^.{1}/us', $string)) {
die;
}
return $string;
}
The u
modifier on that regular expression makes it a Unicode matching regex. By matching a single chararchter, .
, we're assured that the entire string is valid Unicode.
Since this is all context-dependent, it's best to do any of this encoding at the latest possible moment—just before presenting output to the user. Being in this practice also makes it easy to see any places you've missed.
OWASP provides a great deal of information on their XSS prevention cheat sheet.