views:

123

answers:

1

I'm trying to sanitize and XSS-proof some HTML input from the client. I'm using Python 2.6 with Beautiful Soup. I parse the input, strip all tags and attributes not in a whitelist, and transform the tree back into a string.

However...

>>> unicode(BeautifulSoup('text < text'))
u'text < text'

That doesn't look like valid HTML to me. And with my tag stripper, it opens the way to all sorts of nastiness:

>>> print BeautifulSoup('<<script></script>script>alert("xss")<<script></script>script>').prettify()
<
<script>
</script>
script>alert("xss")<
<script>
</script>
script>

The <script></script> pairs will be removed, and what remains is not only an XSS attack, but even valid HTML as well.

The obvious solution is to replace all < characters by &lt; that, after parsing, are found not to belong to a tag (and similar for >&'"). But the Beautiful Soup documentation only mentions the parsing of entities, not the producing of them. Of course I can run a replace over all NavigableString nodes, but since I might miss something, I'd rather let some tried and tested code do the work.

Why doesn't Beautiful Soup escape < (and other magic characters) by default, and how do I make it do that?


N.B. I've also looked at lxml.html.clean. It seems to work on the basis of blacklisting, not whitelisting, so it doesn't seem very safe to me. Tags can be whitelisted, but attributes cannot, and it allows too many attributes for my taste (e.g. tabindex). Also, it gives an AssertionError on the input <SCRIPT SRC=http://ha.ckers.org/xss.js&gt;&lt;/SCRIPT&gt;. Not good.

Suggestions for other ways to clean HTML are also very welcome. I'm hardly the only person in the world trying to do this, yet there seems to be no standard solution.

+2  A: 

The lxml.html.clean.Cleaner class does allow you to provide a tag whitelist with the allow_tags argument and to use the precomputed attribute whitelist from feedparser with the safe_attrs_only argument. And lxml definitely handles entities properly on serialization.

llasram
Like I wrote in the original question, it doesn't allow me to change the list of attributes. For example, `tabindex` can make the site behave in unexpected ways, and with some imagination, the various encoding attributes like `charset` could be used for malicious purposes as well, whereas they are rarely (if ever) useful. I'd rather allow only the attributes that are actually useful (and used).
Thomas
They're on the list of attributes accepted by feedparser, which is pretty paranoid. If you are more paranoid, you can set `lxml.html.defs.safe_attrs` to just the attributes you consider safe. Alternatively if not entirely off the shelf, you can probably reuse the `feedparser._HTMLSanitizer` class, modifying it to suit the attributes you want to allow.
llasram
I don't like that list. The `<form>` element is on it. Oh, and I just spotted a typo in `lxml/html/defs.py`: `marque` instead of `marquee`, still present in the svn version. I've reported a bug. Not a big deal, but it doesn't help to build confidence...
Thomas