Never output any bit of data whatsoever to the HTML stream that has not been passed through htmlspecialchars()
and you're done. Simple rule, easy to follow, completely eradicates any XSS risk.
As a programmer it's your job to do it, though.
You can define
function h(s) { return htmlspecialchars(s); }
if htmlspecialchars()
is too long to write 100 times per PHP file. On the other hand, using htmlentities()
is not necessary at all.
The key point is: There is code, and there is data. If you intermix the two, bad things ensue.
In the case of HTML, code is elements, attribute names, entities, comments. Data is everything else. Data must be escaped to avoid being mistaken for code.
In case of URLs, code is the scheme, the host name, the path, the mechanism of the query string (?
, &
, =
, #
). Data is everything in the query string: parameter names and values. They must be escaped to avoid being mistaken for code.
URLs embedded in HTML must be doubly escaped (by URL-escaping and HTML-escaping) to ensure proper separation of code and data.
Modern browsers are capable of parsing amazingly broken and incorrect markup into something useful. This capability should not be stressed, though. The fact that something happens to work (like URLs in <a href>
without proper HTML-escaping applied) does not mean that it's good or correct to do it. XSS is a problem that roots in a) people unaware of data/code separation (i.e. "escaping") or those that are sloppy and b) people that try to be clever about what part of data they don't need to escape.
XSS is easy enough to avoid if you make sure you don't fall into categories a) and b).