tags:

views:

3661

answers:

7

What's the best way to sanitise user input for a Python-based web application? Is there a single function to remove HTML characters and any other necessary characters combinations to ensure that an XSS or SQL injection attack isn't possible?

+1  A: 

If you are using a framework like django, the framework can easily do this for you using standard filters. In fact, I'm pretty sure django automatically does it unless you tell it not to.

Otherwise, I would recommend using some sort of regex validation before accepting inputs from forms. I don't think there's a silver bullet for your problem, but using the re module, you should be able to construct what you need.

Justin Standard
+4  A: 

Jeff Atwood himself described how StackOverflow.com sanitizes user input (in non-language-specific terms) on the Stack Overflow blog: http://blog.stackoverflow.com/2008/06/safe-html-and-xss/

However, as Justin points out, if you use Django templates or something similar then they probably sanitize your HTML output anyway.

SQL injection also shouldn't be a concern. All of Python's database libraries (MySQLdb, cx_Oracle, etc) always sanitize the parameters you pass. These libraries are used by all of Python's object-relational mappers (such as Django models), so you don't need to worry about sanitation there either.

Eli Courtwright
+15  A: 

Here is a snippet that will remove all tags not on the white list, and all tag attributes not on the attribues whitelist (so you can't use onclick).

It is a modified version of http://www.djangosnippets.org/snippets/205/, with the regex on the attribute values to prevent people from using href="javascript:...", and other cases described at http://ha.ckers.org/xss.html.
(e.g. <a href="ja&#x09;vascript:alert('hi')"> or <a href="ja vascript:alert('hi')">, etc.)

As you can see, it uses the (awesome) BeautifulSoup library.

import re
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup, Comment

def sanitizeHtml(value, base_url=None):
    rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:'))
    rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:'))
    re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE)
    validTags = 'p i strong b u a h1 h2 h3 pre br img'.split()
    validAttrs = 'href src width height'.split()
    urlAttrs = 'href src'.split() # Attributes which should have a URL
    soup = BeautifulSoup(value)
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
        # Get rid of comments
        comment.extract()
    for tag in soup.findAll(True):
        if tag.name not in validTags:
            tag.hidden = True
        attrs = tag.attrs
        tag.attrs = []
        for attr, val in attrs:
            if attr in validAttrs:
                val = re_scripts.sub('', val) # Remove scripts (vbs & js)
                if attr in urlAttrs:
                    val = urljoin(base_url, val) # Calculate the absolute url
                tag.attrs.append((attr, val))

    return soup.renderContents().decode('utf8')

As the other posters have said, pretty much all Python db libraries take care of SQL injection, so this should pretty much cover you.

tghw
I upvoted this, but now I'm not so sure. I don't think this protects IE users from src="vbscript:msgbox('xss')" attacks.
Gareth Simpson
You could easily add that with another regex for vbscript: like the one for javascript:
tghw
Added VBScript protection.
tghw
@tghw, The vbscript example here is why whitelist solutions are generally preferable to blacklist solutions. How can you know for sure that everything you need is blacklisted? With a blacklist, a new browser could come out next week and be vulnerable because it supports a new type of script tag.
gnibbler
@gnibbler I agree, and most of that is a whitelist solution, but for href and src, there's really no way to whitelist easily.The only option I can think of would be to make all URLs absolute by passing in the page URL and then going through each link and image and figure out the absolute URL based on the page URL.The more I think of it, the easier this seems. I'll add it above.
tghw
It's generally very hard to sanitize HTML, there are plenty of vectors: http://nick.cleaton.net/xssrant.html
rjh
Thanks much. I've included this technique in a blog post covering the topic of setting up secure user input with Django: http://birdhouse.org/blog/2010/05/12/secure-user-input-with-django/
shacker
This is perfect. Using it to sanitize comments on my lil' ol' blog. :) -- Thanks.
PKKid
+3  A: 

I don't do web development much any longer, but when I did, I did something like so:

When no parsing is supposed to happen, I usually just escape the data to not interfere with the database when I store it, and escape everything I read up from the database to not interfere with html when I display it (cgi.escape() in python).

Chances are, if someone tried to input html characters or stuff, they actually wanted that to be displayed as text anyway. If they didn't, well tough :)

In short always escape what can affect the current target for the data.

When I did need some parsing (markup or whatever) I usually tried to keep that language in a non-intersecting set with html so I could still just store it suitably escaped (after validating for syntax errors) and parse it to html when displaying without having to worry about the data the user put in there interfering with your html.

See also Escaping HTML

Henrik Gustafsson
+7  A: 

The best way to prevent XSS is not to try and filter everything, but rather to simply do HTML Entity encoding. For example, automatically turn < into &lt;. This is the ideal solution assuming you don't need to accept any html input (outside of forum/comment areas where it is used as markup, it should be pretty rare to need to accept HTML); there are so many permutations via alternate encodings that anything but an ultra-restrictive whitelist (a-z,A-Z,0-9 for example) is going to let something through.

SQL Injection, contrary to other opinion, is still possible, if you are just building out a query string. For example, if you are just concatenating an incoming parameter onto a query string, you will have SQL Injection. The best way to protect against this is also not filtering, but rather to religiously use parameterized queries and NEVER concatenate user input.

This is not to say that filtering isn't still a best practice, but in terms of SQL Injection and XSS, you will be far more protected if you religiously use Parameterize Queries and HTML Entity Encoding.

+11  A: 

html5lib comes with a whitelist-based HTML sanitiser - it's easy to subclass it to restrict the tags and attributes users are allowed to use on your site, and it even attempts to sanitise CSS if you're allowing use of the style attribute.

Here's now I'm using it in my Stack Overflow clone's sanitize_html utility function:

http://code.google.com/p/soclone/source/browse/trunk/soclone/utils/html.py

I've thrown all the attacks listed in ha.ckers.org's XSS Cheatsheet (which are handily available in XML format at it after performing Markdown to HTML conversion using python-markdown2 and it seems to have held up ok.

The WMD editor component which Stackoverflow currently uses is a problem, though - I actually had to disable JavaScript in order to test the XSS Cheatsheet attacks, as pasting them all into WMD ended up giving me alert boxes and blanking out the page.

insin
+1  A: 

To sanitize a string input which you want to store to the database (for example a customer name) you need either to escape it or plainly remove any quotes (', ") from it. This effectively prevents classical SQL injection which can happen if you are assembling an SQL query from strings passed by the user.

For example (if it is acceptable to remove quotes completely):

datasetName = datasetName.replace("'","").replace('"',"")
Mr. Napik