views:

332

answers:

3

Programming a Python web application, I want to create a text area where the users can enter text in a lightweight markup language. The text will be imported to a html template and viewed on the page. Today I use this command to create the textarea, which allows users to enter any (html) text:

my_text = cgidata.getvalue('my_text', 'default_text')
ftable.AddRow([Label(_('Enter your text')),
               TextArea('my_text', my_text, rows=8, cols=60).Format()])

How can I change this so that only some (safe, eventually lightweight) markup is allowed? All suggestions including sanitizers are welcome, as long as it easily integrates with Python.

+2  A: 

You could use restructured text . I'm not sure if it has a sanitizing option, but it's well supported by Python, and it generates all sorts of formats.

Christopher
+1: RST and Docutils rule.
S.Lott
+7  A: 

Use the python markdown implementation

import markdown
mode = "remove" # or "replace" or "escape"
md = markdown.Markdown(safe_mode=mode)
html = md.convert(text)

It is very flexible, you can use various extensions, create your own etc.

molicule
I tried it using iPython, defining text as some html including a script tag. I got a strange output: text was still the same and html = '[HTML_REMOVED]' What else do I need to do to get this to remove the dangerous tags? I tried all three modes with the same result.
Anna Granudd
Running a few tests I realized I'm not allowed to enter any html tags but only markdown syntax and while doing so I get safe output. Thanks, it worked!
Anna Granudd
from the docsTo replace HTML, set safe_mode="replace" (safe_mode=True still works for backward compatibility with older versions). The HTML will be replaced with the text defined in markdown.HTML_REMOVED_TEXT which defaults to [HTML_REMOVED]. To replace the HTML with something else:markdown.HTML_REMOVED_TEXT = "--RAW HTML IS NOT ALLOWED--"
molicule
+1  A: 

This simple sanitizing function uses a whitelist and is roughly the same as the solution of python-html-sanitizer-scrubber-filter, but also allows to limit the use of attributes (since you probably don't want someone to use, among others, the style attribute):

from BeautifulSoup import BeautifulSoup

def sanitize_html(value):
    valid_tags = 'p i b strong a pre br'.split()
    valid_attrs = 'href src'.split()
    soup = BeautifulSoup(value)
    for tag in soup.findAll(True):
        if tag.name not in valid_tags:
            tag.hidden = True
        tag.attrs = [(attr, val) for attr, val in tag.attrs if attr in valid_attrs]
    return soup.renderContents().decode('utf8').replace('javascript:', '')
Gerald Senarclens de Grancy