views:

39

answers:

2

In many places on many of my sites, users are permitted to enter formatted text through a WYSIWYG or through plain text with tags. Naturally, such input is sanitized for security threats, but it is not stripped of tags nor is it fully entity encoded. Something like <p>hello world</p> ends up going back to the end user as <p>hello world</p>.

Most WYSIWYGs are smart enough to clean up the XML before turning the content over to the form, but manual POST requests, non-WYSIWYG text areas, and non-JS users will not be subject to this pull in the right direction. So there's nothing to stop a user from inputting <a href="/">, turning the rest of the page into a link.

What's the best way to treat this?

A: 

If user is about to send text with unclosed tags, it would be smart to give a warning message.

If user wants to post a single tag <br>, it should be closed like <br />. Or should be closed with the ending tag separately </br>.

salchams
+2  A: 

Whatever the user supplies, parse it using an HTML parser. Sanitize it while it's a DOM, then serialize the DOM back to HTML taking the contents of the body element (the parser will create one if necessary) as the string sent back to the end user. All necessary elements will have closing tags in place.

Alohci
It sounds reasonable, although I would prefer an approach that sanitizes it just once and caches the HTML. Either way, can you recommend a tool that does this for PHP and/or Rails?
Steven Xu
Not really. Neither PHP nor Rails are my speciality. There's lots of questions on SO about HTML parsers for various languages though, so I suggest you have a search around, or ask a new question. While HTML sanitization is easier and more reliable working on a DOM, if you've got a third party one with a good reputation it's not necessary to switch. I was just pointing out what a good place it is to sanitize your HTML. Once the user's text has been through a parse/serialize cycle then you'd be free to cache the result.
Alohci