Hi guys, my social networking site is w3c xhtml valid however users are able to post blog reports and stuff and at times enter in ampersand characters which in turn mess up my validation. How can I fix this and are there any other single characters that I need to look out for that could mess up my validation?
When displaying user produced content, run it through the htmlspecialchars() function.
As a matter of general principle it's a mistake to include user-submitted (or indeed any external) content into your page directly without validation or filtering. Besides causing validation errors it can also cause "broken pages" and large security holes (cross-site scripting attacks).
Whenever you get data from anywhere that isn't 100% trusted, you need to make it safe in some way. You can do this by doing some or all of:
- Escaping textual data so that special characters are replaced by the HTML entities that represent them.
- Stripping or filtering unsafe HTML tags.
- Validating that HTML doesn't contain any unsafe or illegal constructs.
If your user input is meant to be interpreted as text then you're mostly looking at option 1; if you're letting the users use HTML then you're looking at options 2 and 3. A fourth option is to have the users use some more restrictive non-HTML markup such as Markdown or bbCode, translating between that markup and HTML using a library that (hopefully) doesn't allow the injection of security holes, page-breaking constructs, or other scary things.
It's a bad idea to allow users to enter HTML markup.
This enables all kinds of nasty things, most notably cross-site scripting (XSS) exploits and injection of hidden spam (hidden from you, not search engine bots).
You should:
Obliterate all HTML tags using
htmlspecialchars()
and only preserve newlines withnl2br()
. You might allow some formatting by implementing your own safe markup that allows only very specific tags (things like phpBB or Wiki-like markup).Use HTML Purifier to reliably eliminate all potentially-dangerous markup. PHP's
strip_tags()
function is fundamentally broken and allows dangerous code in attributes if you use whitelist argument.