views:

279

answers:

5

I want to write a web application that allows users to enter any HTML that can occur inside a <div> element. This HTML will then end up being displayed to other users, so I want to make sure that the site doesn't open people up to XSS attacks.

Is there a nice library in Python that will clean out all the event handler attributes, <script> elements and other Javascript cruft from HTML or a DOM tree?

I am intending to use Beautiful Soup to regularize the HTML to make sure it doesn't contain unclosed tags and such. But, as far as I can tell, it has no pre-packaged way to strip all Javascript.

If there is a nice library in some other language, that might also work, but I would really prefer Python.

I've done a bunch of Google searching and hunted around on pypi, but haven't been able to find anything obvious.

Related

A: 

You could use BeautifulSoup. It allows you to traverse the markup structure fairly easily, even if it's not well-formed. I don't know that there's something made to order that works only on script tags.

kprobst
I know about Beautiful Soup and was thinking of using it for checking the well-formedness of the HTML and cleaning it up a bit. I was hoping though for something that specifically would remove all Javascript.
Omnifarious
A: 

I would honestly look at using something like bbcode or some other alternative markup with it.

MrStatic
I absolutely detest those things when I encounter them. Every web page seems to have their own weird variant markup language that isn't HTML. And I despise them all, especially since most of them didn't rationally consider ways to escape things or how the various bits of markup would combine.I don't want to add to the horror that already exists out there.
Omnifarious
+5  A: 

As Klaus mentions, the clear consensus in the community is to use BeautifulSoup for these tasks:

soup = BeautifulSoup.BeautifulSoup(html)
for script_elt in soup.findAll('script'):
    script_elt.extract()
html = str(soup)
Ned Batchelder
What about all the event attributes?
Omnifarious
On second thought, since you are doing this to prevent security problems, you really do need a whitelist of allowed markup. There are just too many different ways to sneak bad content past blacklist filters.
Ned Batchelder
+4  A: 

Whitelist approach to allowed tags, attributes and their values is the only reliable way. Take a look at Recipe 496942: Cross-site scripting (XSS) defense

What is wrong with existing markup languages such as used on this very site?

J.F. Sebastian
The problem with them is that almost all of them (except for the one used on this site) have strange special cases that aren't accounted for in the markup. For example, how do you bold and italicize something? Or what if you want something in a link quoted? What if you need to use one of the delimiter characters somewhere? It's both ugly, ill-defined and inflexible. The whitelist approach sounds like a plan though.
Omnifarious
A: 

Eric,

Have you thought about using a 'SAX' type parser for the HTML? I'm really not sure though that it would ignore the events properly though. It would also be a bit harder to construct than using something like Beautiful Soup. Handling syntax errors may be a problem with SAX as well.

What I like to do in situations like this is to construct python objects (subclassed from an XML_Element class) from the parsed HTML. Then remove any undesired objects from the tree, and finally re-serialize the objects back to html. It's not all that hard in python.

Regards,

RL Drenth