ansaurus

Question

Is there an Open Source Python library for sanitizing HTML and removing all Javascript?

Answer 1

A:

You could use BeautifulSoup. It allows you to traverse the markup structure fairly easily, even if it's not well-formed. I don't know that there's something made to order that works only on script tags.

kprobst 2009-12-06 08:57:53

I know about Beautiful Soup and was thinking of using it for checking the well-formedness of the HTML and cleaning it up a bit. I was hoping though for something that specifically would remove all Javascript.

Omnifarious 2009-12-06 09:01:22

Answer 2

A:

I would honestly look at using something like bbcode or some other alternative markup with it.

MrStatic 2009-12-06 08:58:08

I absolutely detest those things when I encounter them. Every web page seems to have their own weird variant markup language that isn't HTML. And I despise them all, especially since most of them didn't rationally consider ways to escape things or how the various bits of markup would combine.I don't want to add to the horror that already exists out there.

Omnifarious 2009-12-06 09:03:11

Answer 3

+5 A:

As Klaus mentions, the clear consensus in the community is to use BeautifulSoup for these tasks:

soup = BeautifulSoup.BeautifulSoup(html)
for script_elt in soup.findAll('script'):
    script_elt.extract()
html = str(soup)

Ned Batchelder 2009-12-06 13:03:37

What about all the event attributes?

Omnifarious 2009-12-06 14:33:25

On second thought, since you are doing this to prevent security problems, you really do need a whitelist of allowed markup. There are just too many different ways to sneak bad content past blacklist filters.

Ned Batchelder 2009-12-06 16:56:27

Answer 4

+4 A:

Whitelist approach to allowed tags, attributes and their values is the only reliable way. Take a look at Recipe 496942: Cross-site scripting (XSS) defense

What is wrong with existing markup languages such as used on this very site?

J.F. Sebastian 2009-12-06 16:33:27

The problem with them is that almost all of them (except for the one used on this site) have strange special cases that aren't accounted for in the markup. For example, how do you bold and italicize something? Or what if you want something in a link quoted? What if you need to use one of the delimiter characters somewhere? It's both ugly, ill-defined and inflexible. The whitelist approach sounds like a plan though.

Omnifarious 2009-12-07 04:59:42

Answer 5

A:

Eric,

Have you thought about using a 'SAX' type parser for the HTML? I'm really not sure though that it would ignore the events properly though. It would also be a bit harder to construct than using something like Beautiful Soup. Handling syntax errors may be a problem with SAX as well.

What I like to do in situations like this is to construct python objects (subclassed from an XML_Element class) from the parsed HTML. Then remove any undesired objects from the tree, and finally re-serialize the objects back to html. It's not all that hard in python.

Regards,

RL Drenth 2009-12-07 03:03:28

ansaurus

tags:

views:

answers:

Is there an Open Source Python library for sanitizing HTML and removing all Javascript?

Related

related questions