views:

382

answers:

7

I have an app that reprocesses HTML in order to do nice typography. Now, I want to put it up on the web to let users type in their text. So here's the question: I'm pretty sure that I want to remove the SCRIPT tag, plus closing tags like </form>. But what else should I remove to make it totally safe?

+17  A: 

Oh good lord you're screwed. Take a look at this

Basically, there are so many things you want to strip out. Plus, there's stuff that's valid, but could be used in malicious ways. What if the user wants to set their font size smaller on a footnote? Do you care if that get applied to your entire page? How about setting colors? Now all the words on your page are white on a white background.

I would look into the requirements phase again.

  • Is a markdown-like alternative possible?
  • Can you restrict access to the final content, reducing risk of exposure? (meaning, can you set it up so the user only screws themselves, and can't harm other people?)
Tom Ritter
OMFG, looks like I'm definitely screwed. Luckily, this is for an open-source project, but still... it looks like instead of having prohibited tags, I should have allowed tags instead.
Dmitri Nesteruk
Yes, a whitelist would be better than a blacklist, but even then you have to check for malformed HTML. I think the markdown type solution is a good one.
Ed Swangren
+1  A: 

There are plenty of ways that code could be sneaked in - especially watch for situations like <img src="http://nasty/exploit/here.php"> that can feed a <script> tag to your clients, I've seen <script> blocked on sites before, but the tag got right through, which resulted in 30-40 passwords stolen.

Sukasa
Where is this example of which you speak?
Crescent Fresh
Was on an older, defunct message board. Someone posted a thread in the most-viewed forum and within a few minutes a lot of people had hit the thread and got their passwords stolen.The forum later shut down due to irreconcilable staff issues.
Sukasa
+1  A: 
Lucas Jones
Eh, even with whitelists you have to be careful, since I've seen one person manage to use img to sneak in javascript
Sukasa
+3  A: 

Instead of blacklisting some tags, it's always safer to whitelist. See what stackoverflow does: What HTML tags are allowed on Stack Overflow?

There are just too many ways to embed scripts in the markup. javascript: URLs (encoded of course)? CSS behaviors? I don't think you want to go there.

waqas
A: 

I disagree with person-b. You're forgetting about javascript attributes, like this:

<img src="xyz.jpg" onload="javascript:alert('evil');"/>

Attackers will always be more creative than you when it comes to this. Definitely go with the whitelist approach.

amdfan
Good point! Thanks :)
Lucas Jones
+5  A: 

You should take the white-list rather than the black-list approach: Decide which features are desired, rather than try to block any unwanted feature.

Make a list of desired typographic features that match your application. Note that there is probably no one-size-fits-all list: It depends both on the nature of the site (programming questions? teenagers' blog?) and the nature of the text box (are you leaving a comment or writing an article?). You can take a look at some good and useful text boxes in open source CMSs.

Now you have to chose between your own markup language and HTML. I would chose a markup language. The pros are better security, the cons are incapability to add unexpected internet contents, like youtube videos. A good idea to prevent users' rage is adding an "HTML to my-site" feature that translates the corresponding HTML tags to your markup language, and delete all other tags.

The pros for HTML are consistency with standards, extendability to new contents types and simplicity. The big con is code injection security issues. Should you pick HTML tags, try to adopt some working system for filtering HTML (I think Drupal is doing quite a good job in this case).

Adam Matan
A: 

MediaWiki is more permissive than this site; yes, it accepts setting colors (even white on white), margins, indents and absolute positioning (including those that would put the text completely out of screen), null, clippings and "display;none", font sizes (even if they are ridiculously small or excessively large) and font-names (even if this is a legacy non-Unicode Symbol font name that will not render text successfully), as opposed to this site which strips out almost everything.

But MediaWiki successifully strips out the dangerous active scripts from CSS (i.e. the behaviors, the onEvent handlers, the active filters or javascript link targets) without filtering completely the style attribute, and bans a few other active elements like object, embed, bgsound.

Both sits are banning marquees as well (not standard HTML, and needlessly distracting).

But MediaWiki sites are patrolled by lots of users and there are policy rules to ban those users that are abusing repeatedly.

It offers support for animated iamges, and provides support for active extensions, such as to render TeX maths expressions, or other active extensions that have been approved (like timeline), or to create or customize a few forms.

Verdy_p