tags:

views:

324

answers:

6
+2  Q: 

Unsafe Html

I'm building a simple web based forum application. I want to allow users to include html in their posts, but would like to stop any cross site scripting. My current stratagy is to not allow any "script" tags, to only allow "style" and "href" attributes on any tag, and to not allow "href" values to start with "javascript:". Is there anything that I'm missing?

UPDATE: I ended up solving this with a "whitelist" of html elements. When invalid elements are found, I strip off the tag but leave the inner html. This solves the problem of people copying and pasting from a MS Word document. I also looked into antisamy.net but ran into some issues with how it handled style attributes on spans (i.e. removes them). If I can get that worked out I may switch over to that solution.

+7  A: 

Make sure you take out iframe, object, embed. There's quite a lot actually.

Perhaps what would be better is to allow Markdown instead?

Jason Berry
Great point! +1 for you :)
herbrandson
I looked at using Markdown, but decided against it. I think a wysiwyg editor is more intuitive for non-technical users.
herbrandson
There's no contradiction between storing the text as Markdown and displaying it as WYSIWYG. In fact, if you make a variant that uses XML, you could create the WYSIWYG by pumping it through XSLT.
Steven Sudit
+10  A: 

You should follow the approach of StackOverflow and other sites, and use a whitelist for both tags and attributes. It sounds like you're using a whitelist for attributes, which is good. You should do so for elements as well so people don't sneak things in like form, iframe, meta, frameset, etc. (none of which you mentioned).

Matthew Flaschen
I actually started with a white list. The problem I ran into was dealing with text copied and pasted from MS Word. I'm not sure I can anticipate all the elements I might get in that case. The text goes into a wysiwyg editor so it's not clear to the user what html is actually being sent to the server. Also, the users aren't really techno-savvy so they wouldn't know how to fix the issue if pasting from Word gave them an error message.
herbrandson
Pasting from MS Word into WYSIWYG editors is a pain! Some editors (FCKEditor and I think TinyMCE - my preferred) will allow you to intercept the Ctrl+V and paste functions of the browser and force the user to paste the MS Word content as plain text. It won't carry across formatting, but it'll be clean! TinyMCE actually has a "Paste from Word" feature too.
Jason Berry
A possible solution is to have your WYSIWYG control do sanitization on the client-side. That way, it can fix most of Word's notoriously bad HTML, and you can still have a server whitelist for security.
Matthew Flaschen
+2  A: 

I'd look at removing any onclick or really on[anything] tags. It might be easier to build a list of what's allowed instead of a blacklist.

marcc
A: 

What do you say about:

style='background-image:url("my-site-which-inserts-something-that-will make-you-look-bad")'

And not entirely connected: make sure that if you allow people to upload files to the sites (images/txt/whatever) that they will be served from a different domain name.

Itay Moav
True. And for that matter, <x style="express/**/ion:(alert(/bah!/))"> (taken from this post http://stackoverflow.com/questions/551480/writing-xss-filter-for-xhtml-based-on-white-list)
herbrandson
A: 

If you are using PHP you can strip out everything but the elements you want to allow with strip_tags

strip_tags(string,allow)

This would output:

<?php
echo strip_tags("Hello <b><i>world!</i></b>","<b>");
?>

Hello world!

You should use this approach with:

mysql_real_escape_string();
htmlentities();
Joe
This is not adequate, because it does nothing for dangerous attributes (such as the `on`s mentioned by marcc)
Matthew Flaschen
The problem isn't really how to remove the tags, it's what tags should not be allowed.
herbrandson
also, I'm not using PHP :(
herbrandson
+1  A: 

A whitelist is the safest solution.

You mentioned in a comment about pasting from Word. Don't count on knowing all of Word's HTML elements, it often comes back with crap like <o:p> for paragraphs (which generally only work as expected in Internet Explorer). You may be able to find most of these but there could easily be some dangerous tags, perhaps an <o:script> tag or something.

By the way, there really aren't that many HTML tags. The W3.org index of elements will help you.

DisgruntledGoat