views:

325

answers:

3

Hi guys, my social networking site is w3c xhtml valid however users are able to post blog reports and stuff and at times enter in ampersand characters which in turn mess up my validation. How can I fix this and are there any other single characters that I need to look out for that could mess up my validation?

+8  A: 

When displaying user produced content, run it through the htmlspecialchars() function.

Zed
more specifically, sanitize and validate *any* external inputs, be they from users or external systems. EG: you may trust Google to do the right thing with OpenID, but do you trust all OpenID providers? Know what is valid for any given bit of input, remove/fail anything outside this. Know what is valid for any display, remove/fail anything outside this. That also goes for people inserting, for example, JavaScript into forms.
ptomli
The thing is that I'm allowing the user to enter specific tags like bold tags and quote tags
Ali
If you allow HTML, you're going to get malformed markup from users, fact of life; the ampersands are only the tip of the iceberg. You could run the input markup through HTML Tidy to fix it up, but if you allow posting HTML you've also got a big security problem as any user can inject scripting content into your site, attacking any other user. If you must allow markup to be posted look at HTML Purifier, but as hobbs says if all you want is simple stuff like bold and italic you will be *much* better off (for both security and usability for your users) with a simple markup language.
bobince
Jacco
+2  A: 

As a matter of general principle it's a mistake to include user-submitted (or indeed any external) content into your page directly without validation or filtering. Besides causing validation errors it can also cause "broken pages" and large security holes (cross-site scripting attacks).

Whenever you get data from anywhere that isn't 100% trusted, you need to make it safe in some way. You can do this by doing some or all of:

  1. Escaping textual data so that special characters are replaced by the HTML entities that represent them.
  2. Stripping or filtering unsafe HTML tags.
  3. Validating that HTML doesn't contain any unsafe or illegal constructs.

If your user input is meant to be interpreted as text then you're mostly looking at option 1; if you're letting the users use HTML then you're looking at options 2 and 3. A fourth option is to have the users use some more restrictive non-HTML markup such as Markdown or bbCode, translating between that markup and HTML using a library that (hopefully) doesn't allow the injection of security holes, page-breaking constructs, or other scary things.

hobbs
+2  A: 

It's a bad idea to allow users to enter HTML markup.

This enables all kinds of nasty things, most notably cross-site scripting (XSS) exploits and injection of hidden spam (hidden from you, not search engine bots).

You should:

  • Obliterate all HTML tags using htmlspecialchars() and only preserve newlines with nl2br(). You might allow some formatting by implementing your own safe markup that allows only very specific tags (things like phpBB or Wiki-like markup).

  • Use HTML Purifier to reliably eliminate all potentially-dangerous markup. PHP's strip_tags() function is fundamentally broken and allows dangerous code in attributes if you use whitelist argument.

porneL