ansaurus

Question

Python HTML sanitizer / scrubber / filter

Answer 1

+19 A:

Here's a simple solution using BeautifulSoup:

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']

def sanitize_html(value):

    soup = BeautifulSoup(value)

    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True

    return soup.renderContents()

If you want to remove the contents of the invalid tags as well, substitute tag.extract() for tag.hidden.

You might also look into using lxml and Tidy.

bryan 2009-03-30 23:35:40

Thanks, I didn't need this ATM, but knew I would need to find something like this in the future.

jfar 2009-03-30 23:37:13

The import statement should probably be `from BeautifulSoup import BeautifulSoup`.

Nikhil Chelliah 2009-05-30 21:21:13

You may also want to limit the use of attributes. To do so, just add this to the solution above: valid_attrs = 'href src'.split() for ...: ... tag.attrs = [(attr, val) for attr, val in tag.attrs if attr in valid_attrs]hth

Gerald Senarclens de Grancy 2009-08-03 20:46:44

This is not safe! See the answer by Chris Dost: http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/812785#812785

Thomas 2010-09-10 11:33:44

Answer 2

A:

Have you tried with BeautifulSoup?

miya 2009-03-30 23:36:42

Answer 3

+9 A:

The above solutions via Beautiful Soup will not work. You might be able to hack something with Beautiful Soup above and beyond them, because Beautiful Soup provides access to the parse tree. In a while, I think I'll try to solve the problem properly, but it's a week-long project or so, and I don't have a free week soon.

Just to be specific, not only will Beautiful Soup throw exceptions for some parsing errors which the above code doesn't catch; but also, there are plenty of very real XSS vulnerabilities that aren't caught, like:

<<script>script> alert("Haha, I hacked your page."); </</script>script>

Probably the best thing that you can do is instead to strip out the < element as <, to prohibit all HTML, and then use a restricted subset like Markdown to render formatting properly. In particular, you can also go back and re-introduce common bits of HTML with a regex. Here's what the process looks like, roughly:

_lt_     = re.compile('<')
_tc_ = '~(lt)~'   # or whatever, so long as markdown doesn't mangle it.     
_ok_ = re.compile(_tc_ + '(/?(?:u|b|i|em|strong|sup|sub|p|br|q|blockquote|code))>', re.I)
_sqrt_ = re.compile(_tc_ + 'sqrt>', re.I)     #just to give an example of extending
_endsqrt_ = re.compile(_tc_ + '/sqrt>', re.I) #html syntax with your own elements.
_tcre_ = re.compile(_tc_)

def sanitize(text):
    text = _lt_.sub(_tc_, text)
    text = markdown(text)
    text = _ok_.sub(r'<\1>', text)
    text = _sqrt_.sub(r'&radic;<span style="text-decoration:overline;">', text)
    text = _endsqrt_.sub(r'</span>', text)
    return _tcre_.sub('&lt;', text)

I haven't tested that code yet, so there may be bugs. But you see the general idea: you have to blacklist all HTML in general before you whitelist the ok stuff.

2009-05-01 19:05:45

if you're trying this first do:import refrom markdown import markdownif you don't have markdown you can try easy_install

Luke Stanley 2010-01-04 18:14:19

Answer 4

+12 A:

Here is what i use in my own project. The acceptable_elements/attributes come from feedparser and BeautifulSoup does the work.

from BeautifulSoup import BeautifulSoup

acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
      'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
      'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em',
      'font', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 
      'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 
      'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike',
      'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th',
      'thead', 'tr', 'tt', 'u', 'ul', 'var']

acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
  'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
  'char', 'charoff', 'charset', 'checked', 'cite', 'clear', 'cols',
  'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 
  'enctype', 'for', 'headers', 'height', 'href', 'hreflang', 'hspace',
  'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'method',
  'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 
  'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'shape', 'size',
  'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
  'usemap', 'valign', 'value', 'vspace', 'width']

def clean_html( fragment ):
    while True:
        soup = BeautifulSoup( fragment )
        removed = False        
        for tag in soup.findAll(True): # find all tags
            if tag.name not in acceptable_elements:
                tag.extract() # remove the bad ones
                removed = True
            else: # it might have bad attributes
                # a better way to get all attributes?
                for attr in tag._getAttrMap().keys():
                    if attr not in acceptable_attributes:
                        del tag[attr]

        # turn it back to html
        fragment = unicode(soup)

        if removed:
            # we removed tags and tricky can could exploit that!
            # we need to reparse the html until it stops changing
            continue # next round

        return fragment

Some small tests to make sure this behaves correctly:

tests = [   #text should work
            ('<p>this is text</p>but this too', '<p>this is text</p>but this too'),
            # make sure we cant exploit removal of tags
            ('<<script></script>script> alert("Haha, I hacked your page."); <<script></script>/script>', ''),
            # try the same trick with attributes, gives an Exception
            ('<div on<script></script>load="alert("Haha, I hacked your page.");">1</div>',  Exception),
             # no tags should be skipped
            ('<script>bad</script><script>bad</script><script>bad</script>', ''),
            # leave valid tags but remove bad attributes
            ('<a href="good" onload="bad" onclick="bad" alt="good">1</div>', '<a href="good" alt="good">1</a>'),
]

for text, out in tests:
    try:
        res = clean_html(text)
        assert res == out, "%s => %s != %s" % (text, res, out)
    except out, e:
        assert isinstance(e, out), "Wrong exception %r" % e

THC4k 2009-05-01 19:26:54

Thanks, this works like a charm.

David Underhill 2010-08-17 01:21:28

This is not safe! See the answer by Chris Dost: http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/812785#812785

Thomas 2010-09-10 11:32:30

@Thomas: Do you have anything to support that claim? Chris Dost "unsafe" code actually just raises an Exception, so I guess you didn't actually try it.

THC4k 2010-09-10 15:07:05

@THC4k: Sorry, I forgot to mention that I had to modify the example. Here's one that works: `<<script></script>script> alert("Haha, I hacked your page."); <<script></script>script>`

Thomas 2010-09-10 15:19:24

Also, the `tag.extract()` modifies a list that we're iterating over. That confuses the loop, and causes it to skip the next child.

Thomas 2010-09-10 15:20:12

@Thomas: Really nice catches! I think I fixed both issues, thanks alot!

THC4k 2010-09-10 16:22:35

Answer 5

+8 A:

Use lxml.html.clean!

Suppose the following html:

html = '''\
<html>
 <head>
   <script type="text/javascript" src="evil-site"></script>
   <link rel="alternate" type="text/rss" src="evil-rss">
   <style>
     body {background-image: url(javascript:do_evil)};
     div {color: expression(evil)};
   </style>
 </head>
 <body onload="evil_function()">
    <!-- I am interpreted for EVIL! -->
   <a href="javascript:evil_function()">a link</a>
   <a href="#" onclick="evil_function()">another link</a>
   <p onclick="evil_function()">a paragraph</p>
   <div style="display: none">secret EVIL!</div>
   <object> of EVIL! </object>
   <iframe src="evil-site"></iframe>
   <form action="evil-site">
     Password: <input type="password" name="password">
   </form>
   <blink>annoying EVIL!</blink>
   <a href="evil-site">spam spam SPAM!</a>
   <image src="evil!">
 </body>
</html>'''

So easy!

from lxml.html.clean import clean_html
print clean_html(html)

The results...

<html>
  <body>
    <div>
      <style>/* deleted */</style>
      <a href="">a link</a>
      <a href="#">another link</a>
      <p>a paragraph</p>
      <div>secret EVIL!</div>
      of EVIL!
      Password:
      annoying EVIL!
      <a href="evil-site">spam spam SPAM!</a>
      <img src="evil!">
    </div>
  </body>
</html>

You can customize the elements you want to clean and whatnot.

nosklo 2010-04-23 23:43:24

See the docstring for `lxml.html.clean.clean()` method. It has plenty of options!

Denilson Sá 2010-07-17 13:39:05

By the way, you should fix the link at the top of your answer. It should point to http://codespeak.net/lxml/lxmlhtml.html#cleaning-up-html

Denilson Sá 2010-07-17 13:42:47

ansaurus

tags:

views:

answers:

Python HTML sanitizer / scrubber / filter

related questions