views:

265

answers:

5

I'm working on StackQL.net, which is just a simple web site that allows you to run ad hoc tsql queries on the StackOverflow public dataset. It's ugly (I'm not a graphic designer), but it works.

One of the choices I made is that I do not want to html encode the entire contents of post bodies. This way, you see some of the formatting from the posts in your queries. It will even load images, and I'm okay with that.

But I am concerned that this will also leave <script> tags active. Someone could plant a malicious script in a stackoverflow answer; they could even immediately delete it, so no one sees it. One of the most common queries people try when they first visit is a simple Select * from posts, so with a little bit of timing a script like this could end up running in several people's browsers. I want to make sure this isn't a concern before I update to the (hopefully soon-to-be-released) October data export.

What is the best, safest way to make sure just script tags end up encoded?

+2  A: 

Don't forget onclick, onmouseover, etc or javascript: psuedo-urls (<img src="javascript:evil!Evil!">) or CSS (style="property: expression(evil!Evil!);") or…

There are a host of attack vectors beyond simple script elements.

Implement a white list, not a black list.

David Dorward
A: 

What about simply breaking the <script> tags? Escaping only < and > for that tag, ending up with &lt;script&gt;, could be one simple and easy way.

Of course links are another vector. You should also disable every instance of href='javascript:', and every attribute starting with on*.

Just to be sure, nuke it from orbit.

voyager
Replacing < and > would break other formatting I want to allow.
Joel Coehoorn
@Joel Coehoorn: and replacing <script> for <script>? There is no way past that simple replacement. The tricky place will be the *other* vectors: the on* events.
voyager
Okay, I thought you mean _all_ < and >.
Joel Coehoorn
+1  A: 

If the messages are in XHTML format then you could do an XSL transform and encode/strip tags and properties that you don't want. It gets a little easier if you use something like TinyMCE or CKEditor to provide a wysiwyg editor that outputs XHTML.

John Cavan
Anything you can put here at StackOverflow can end up in the data. I can't count on well-formed xml.
Joel Coehoorn
+3  A: 

You may want to modify the HTMLSanatize script to fit your purposes. It was written by Jeff Atwood to allow certain kinds of HTML to be shown. Since it was written for Stack Overflow, it'd fit your purpose as well.

I don't know whether it's 'up to date' with what Jeff currently has deployed, but it's a good starting point.

George Stocker
This is likely to end up as the accepted answer, but I won't get to try it out until this weekend.
Joel Coehoorn
@George: I did a rep recalc as you requested. Sorry about the significant loss. On the bright side, next time Jeff does a system-wide recalc you should barely notice it.
Bill the Lizard
Thanks for doing the recalc Bill. there were a lot of migrated questions I answered, so I wasn't surprised it was a 600 point hit.
George Stocker
A: 

But I am concerned that this will also leave <script tags active.

Oh, that's just the beginning of HTML ‘malicious content’ that can cause cross-site scripting. There's also event handlers; inline, embedded and linked CSS (expressions, behaviors, bindings), Flash and other embeddable plugins, iframes to exploit sites, javascript: and other dangerous schemes (there are more than you think!) in every place that can accept a URL, meta-refresh, UTF-8 overlongs, UTF-7 mis-sniffing, data binding, VML and other non-HTML stuff, broken markup parsed as scripts by permissive browsers...

In short any quick-fix attempt to sanitise HTML with a simple regex will fail badly.

Either escape everything so that any HTML is displayed as plain text, or use a full parser-and-whitelist-based sanitiser. (And keep it up-to-date, because even that's a hard job and there are often newly-discovered holes in them.)

But aren't you using the same Markdown system as SO itself to render posts? That would be the obvious thing to do. I can't guarantee there are no holes in Markdown that would allow cross-site scripting (there certainly have been in the past and there are probably some more obscure ones still in there as it's quite a complicated system). But at least you'd be no more insecure than SO is!

bobince
Yeah, I think I'm going with the Html Sanitize script already suggested by George.
Joel Coehoorn