views:

951

answers:

3

Hello, I need to implement a simple and efficient XSS Filter in C++ for CppCMS. I can't use existing high quality filters written in PHP because because it is high performance framework that uses C++.

The basic idea is provide a filter that have a while list of HTML tags and a white list of options for these tags. For example. typical HTML input can consist of <b>, <i>, tags and <a> tag with href. But straightforward implementation is not good enough, because, even allowed simple links may include XSS:

<a href="javascript:alert('XSS')">Click On Me</a>

There are many other examples can be found there. So I though also about a possibility to create a white list of prefixes for tags like href/src -- so I always need to check if it starts with (https?|ftp)://

Questions:

  • Are these assumptions are good enough for most of purposes? Meaning that If I do not give an options for style tags and check src/href using white list of prefixes it solves XSS problems? Are there problems that can't be fixes this way?
  • Is there a good reference for formal grammar of HTML/XHTML in order to write simple parser that would cleanup all incorrect of forbidden tags like <script>
+3  A: 

You can take a look at the Anti Samy project, trying to accomplish the same thing. It's Java and .NET though.

Edit 1, A bit extra :

You can potentially come up with a very strict white listing. It should be structured well and should be pretty tight and not much flexible. When you combine flexibility, so many tags, attributes and different browsers generally you end up with a XSS vulnerability.

I don't know what is your requirements but I'd go with a strict and simple tag support (only b li h1 etc.) and then strict attribute support based on the tag (for example src is only valid under href tag), then you need to do whitelisting in the attribute values as you stated http|https|ftp or style="color|background-color" etc.

Consider this one:

<x style="express/**/ion:(alert(/bah!/))">

Also you need to think about some character whitelisting or some UTF-8 normalization, because different encodings can cause awkward issues. Such as new lines in attributes, non valid UTF-8 sequences.

dr. evil
+1  A: 

All details of HTML parsing are specified in HTML 5. However implementation of it is quite a lot of work, and it doesn't matter whether you'll parse HTML exactly with all corner cases. At worst you'll end up with different DOM, but you have to sanitize DOM anyway.

porneL
+1  A: 

As you mentioned, there are various PHP implementations of this, but I don't know of any in C++, since that's not a language typically applied to web development. Overall, it's going to depend on how complex of an implementation you want to come up with.

A very restrictive whitelist is probably the "simplest" way, but if you want to be really comprehensive I would look into doing a conversion of one of the established versions to C++, as opposed to trying to write your own from scratch. There are so many tricks to worry about, that I think you'd be better off standing on the shoulders of others that have already gone through all that.

I don't know anything about using C++ for web development, but converting PHP to it doesn't seem like it would be a particularly difficult task, PHP doesn't really have any magical capabilities that C++ won't be able to duplicate. I'm sure there will be some small hitches, but overall if you want to go the more-complex route it'd definitely still be faster to do a conversion than a full design from scratch.

HTML Purifier seems like a strong PHP implementation that is still actively maintained, there's a comparison document where the author discuss some differences between his approach and others', probably worth reading.

Whatever you come up with, definitely test it with all the examples you link, and make sure it passes all those. Good luck!

Chad Birch