ansaurus

Question

What is the best way to handle user generated html content that will be viewed by the public?

Answer 1

+1 A:

That's an entirely reasonable approach. For typical applications it will be entirely sufficient.

The trickiest part of white-listing raw HTML is the style attribute and embed/object. There are legitimate reasons why someone might want to put CSS styles into an otherwise untrusted block of formatted text, or say, an embedded YouTube video. This issue comes up most commonly with feeds. You can't trust the arbitrary block of text contained within a feed entry, but you don't want to strip out, e.g., syntax highlighting CSS or flash video, because that would fundamentally change the content and potentially confuse anyone reading it. Because CSS can contain dangerous things like behaviors in IE, you may have to parse the CSS if you decide to allow the style attribute to stay in. And with embed/object you may need to white-list hostnames.

Addenda:

In worst case scenarios, HTML escaping everything in sight can lead to a very poor user experience. It's much better to use something like one of the HTML5 parsers to go through the DOM with your whitelist. This is much more flexible in terms of how you present the sanitized output to your users. You can even do things like:

<div class="sanitized">
  <div class="notice">
    This was sanitized for security reasons.
  </div>
  <div class="raw"><pre>
    &lt;script&gt;alert("XSS!");&lt;/script&gt;
  </pre></div>
</div>

Then hide the .raw stuff with CSS, and use jQuery to bind a click handler to the .sanitized div that toggles between .raw and .notice:

CSS:

.raw {
  display: none;
}

jQuery:

$('.sanitized').click(function() {
  $(this).find('.notice').toggle();
  $(this).find('.sanitized').toggle();
});

Bob Aman 2009-10-22 17:46:07

I havn't yet allowed css styles to be used as content but I want to allow video soon. Figured that was a question on its own.

Aaron 2009-10-22 18:02:58

It is. In most cases, I'd recommend going the Facebook route. Treat videos like attachments, rather having them as part of the content.

Bob Aman 2009-10-22 18:43:30

oh excellent Idea!

Aaron 2009-10-30 18:05:49

Answer 2

+1 A:

The white list is a good move. Any black list solution is prone to letting through more than it should, because you just can't think of everything. I've seen some attemts of using black lists (for example The Code Project), and if they manage to catch everything, generally they still cause additional problems like replacing characters in code so that it can't be used without manually restoring it first.

The safest method would be:

HTML encode all the text.
Match a set of allowed tags and attributes and decode those.

Using a regular expression you can even require that each opening tag has a closing tag, so that an unclosed tag can't mess up the page.

You should be able to do this in something like ten lines of code, so the code that you linked to seems overly complicated.

Guffa 2009-10-22 17:49:43

ansaurus

tags:

views:

answers:

What is the best way to handle user generated html content that will be viewed by the public?

related questions