views:

98

answers:

3

Hi guys,

One of the first things I learned as a web developer was to never ever accept any HTML from the client. (Perhaps only if I HTML encode it.)
I use a WYSIWYG editor (TinyMCE) that outputs HTML. So far I have only used it on an admin page, but now I'd like to also use it on a forum. It has a BBCode module, but that seems to be incomplete. (It is possible that BBCode itself doesn't support everything I want it to.)

So, here's my idea:

I allow the client to directly POST some HTML code. Then, I check the code for sanity (well-formedness) and remove all tags, attributes, and CSS rules that are not allowed based on a pre-defined set of allowed tags and styles.
Obviously I would allow the stuff that can be outputted by the subset of TinyMCE functionality I use.

I would allow the following tags:
span, sub, sup, a, p, ul, ol, li, img, strong, em, br

With the following attributes:
style (for everything), href and title (for a), alt and src (for img)

And the following CSS rules:
color, font, font-size, font-weight, font-style, text-decoration

These cover everything that I need for formatting, and (as far as I know) don't present any security risk. Basically, the enforcement of well-formedness and the lack of any layouting styles prevent anyone to hurt the layout of the site. The disallow of the script tag and the likes prevent XSS.
(One exception: maybe I should allow width/height in a predefined range for images.)

Other advantage: this stuff would save me from the need to write / look for a BBCode-Html converter.

What do you think?
Is this a secure thing to do?

(As I see, StackOverflow also allows some basic HTML in the "About Me" field, so I think I'm not the first one to implement this.)

EDIT:

I found this answer which explains how to do this fairly easily.
And of course, noone should think about using regex for this.

The question itself is not related to any language or technology, but if you are wondering, I write this application in ASP.NET.

+4  A: 

It's unclear what programming language you're using or are preferring, but in Java there's Jsoup, which is a pretty slick HTML parser API which contains among others a HTML cleaner based on a customizable whitelist of HTML tags and attributes (unfortunately no CSS rules since that's completely out the scope of a HTML parser). Here's an extract of relevance from its site.

Sanitize untrusted HTML

Problem

You want to allow untrusted users to supply HTML for output on your website (e.g. as comment submission). You need to clean this HTML to avoid cross-site scripting (XSS) attacks.

Solution

Use the jsoup HTML Cleaner with a configuration specified by a Whitelist.

String unsafe = 
      "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
      // now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>

The Whitelist class itself contains several predefinied whitelists which may be of use, like Whitelist#basic() and Whitelist#relaxed().

For .NET, there's by the way a Jsoup port with the name NSoup

BalusC
@BalusC - This NSoup thingy has made it very simple! Thanks for the link! :)
Venemo
Interesting fact: this port was triggered by [one of my Jsoup answers](http://stackoverflow.com/questions/2835505/how-to-scan-a-website-or-page-for-info-and-bring-it-into-my-program/2835555#2835555) :) (check the comments)
BalusC
@BalusC: very nice :) The only thing I miss from it is the cleanup of CSS rules. (Or is it there? I couldn't found it.)
Venemo
@BalusC - NSoup basically solved the problem for me, so I accepted your answer.
Venemo
You're welcome.
BalusC
+2  A: 

For PHP, check out HTML Purifier, it filters out with very advanced customizable settings (like allowed/disallowed tags, attributes, styles, etc), including XSS and tricky style (e.g. display: none) protection.

Also, TinyMCE does do a bit of filtering but since it's client-side you're not supposed to trust it anyway.

Andrew67
Well, the first-ever rule of web development is, never ever rely on client-side-only validation. :) Thanks for the PHP link, though I don't use PHP, I have a few friends who do. I can recommend it to them. :)
Venemo
+1  A: 

Of the tags you plan to allow, <a> definitely requires extra attention, due to the possibility of javascript: URLs. And of course you need to disallow javascript event handlers from all tags.

Michael Borgwardt
@Michael, of course you are right. Your answer made me undestrand what the `AddProtocols` method in NSoup is for. Thanks! :)
Venemo
@Michael, if by the javascript event handlers you mean the `onload`, `onclick` etc., those will be erased since they are not among the allowed attributes. Is there anything else that needs to be taken care of against javascript?
Venemo
Also watch out for Opera, as it allows for `javascript:` in img tags' `src` attributes...
Ivo Wetzel
@Ivo - Thanks! :)
Venemo