ansaurus

Question

Remove HTML tags in AppEngine Python Env (equivalent to Ruby's Sanitize)

Answer 1

+1 A:

Using lxml:

htmlstring = '<b><a href="http://foo.com/"&gt;foo&lt;/a&gt;&lt;/b&gt;&lt;img src="http://foo.com/bar.jpg" />'

from lxml.html import fromstring

mySearchTree = fromstring(htmlstring)

for item in mySearchTree.cssselect('a'):
    print item.text

bigredbob 2010-03-10 06:59:10

Thanks, bigredbob. Took me a while to get lxml running on my machine to your code - it works! I have not tested it on App-Engine yet and will let you know if it does not work. lxml also seems to manage bad markup.

Ecognium 2010-03-10 07:37:13

Answer 2

+4 A:

>>> import BeautifulSoup
>>> html = '<b><a href="http://foo.com/"&gt;foo&lt;/a&gt;&lt;/b&gt;&lt;img src="http://foo.com/bar.jpg" />'
>>> bs = BeautifulSoup.BeautifulSoup(html)  
>>> bs.findAll(text=True)
[u'foo']

This gives you a list of (Unicode) strings. If you want to turn it into a single string, use ''.join(thatlist).

Alex Martelli 2010-03-10 06:59:51

Thanks, Alex. That should work -- last time I tried BeautifulSoup I got into parsing node by node and that became very slow. Now that I have changed the way I am handling my HTML code, I could use BeautifulSoup for cleanup. I totally forgot about the text=True option. Thanks!

Ecognium 2010-03-10 07:21:27

@Ecognium, you're welcome!

Alex Martelli 2010-03-10 14:51:36

Answer 3

+1 A:

#!/usr/bin/python

from xml.dom.minidom import parseString

def getText(el):
    ret = ''
    for child in el.childNodes:
        if child.nodeType == 3:
            ret += child.nodeValue
        else:
            ret += getText(child)
    return ret

html = '<b>this is <a href="http://foo.com/"&gt;a link </a> and some bold text  </b> followed by <img src="http://foo.com/bar.jpg" /> an image'
dom = parseString('<root>' + html + '</root>')
print getText(dom.documentElement)

Prints:

this is a link and some bold text followed by an image

Amarghosh 2010-03-10 07:00:52

Thanks Amarghosh. I think minidom is supported on app-engine so that should work well.

Ecognium 2010-03-10 07:18:57

Amarghosh, I have accepted Alex's answer as BeautifulSoup seems to handle bad markup better. Thanks very much for the snippet, however and I can certainly use it for the markup that I can trust.

Ecognium 2010-03-10 07:38:13

Answer 4

+1 A:

If you don't want to use separate libs then you can import standard django utils. For example:

from django.utils.html import strip_tags
html = '<b><a href="http://foo.com/"&gt;foo&lt;/a&gt;&lt;/b&gt;&lt;img src="http://foo.com/bar.jpg'
stripped = strip_tags(html)
print stripped 
# you got: foo

Also its already included in Django templates, so you dont need anything else, just use filter, like this:

{{ unsafehtml|striptags }}

Btw, this is one of the fastest way.

Mikhail Kashkin 2010-03-10 16:42:27

Thanks, Mikhail. I will give it a shot.

Ecognium 2010-03-11 07:51:04

ansaurus

tags:

views:

answers:

Remove HTML tags in AppEngine Python Env (equivalent to Ruby's Sanitize)

related questions