ansaurus

Question

Answer 1

+1 A:

There is nothing built into the Python stdlib to unescape HTML, but there's a short script you can tailor to your needs at http://www.w3.org/QA/2008/04/unescape-html-entities-python.html.

Benjamin Pollack 2009-03-19 17:03:50

Answer 2

+9 A:

HTMLParser has the functionality in the standard library. It is, unfortunately, undocumented:

>>> import HTMLParser
>>> h= HTMLParser.HTMLParser()
>>> h.unescape('alpha &lt; &beta;')
u'alpha < \u03b2'

htmlentitydefs is documented, but requires you to do a lot of the work yourself.

If you only need the XML predefined entities (lt, gt, amp, quot, apos), you could use minidom to parse them. If you only need the predefined entities and no numeric character references, you could even just use a plain old string replace for speed.

bobince 2009-03-19 17:20:56

+1 I didn't know that function of HTMLParser

vartec 2009-03-19 17:48:26

Answer 3

+1 A:

Use htmlentitydefs module. This my old code, it worked, but I'm sure there is cleaner and more pythonic way to do it:

e2c = dict(('&%s;'%k,eval("u'\\u%04x'"%v)) for k, v in htmlentitydefs.name2codepoint.items())

vartec 2009-03-19 17:22:30

Answer 4

+2 A:

I forgot to tag it at first, but I'm using BeautifulSoup.

Digging around in the documentation, I found:

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

does it exactly as I was hoping.

tghw 2009-03-19 17:45:15

ansaurus

tags:

views:

answers:

HTML Entity Codes to Text

related questions