views:

408

answers:

2

I'm trying to work out if there is a better way to achieve the following:

from lxml import html
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("<p>&pound;682m</p>")
text = soup.find("p").string

print text
>>> &pound;682m

print html.fromstring(text).text
>>> £682m

So I'm trying to produce the same string that lxml returns when I do the second print. I'd rather not have to resort to lxml in order to interpret these escaped characters: can anyone provide a way of doing this with something in the standard library?

[edit: I've accepted luc's answer but both are valid: I just thought that the answer that made use of the standard library was probably more useful in a generic sense]

+5  A: 

You can also use the Html parser from the standard lib see http://docs.python.org/library/htmlparser.html

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> print h.unescape('&pound;682m')
£682m
luc
Note that this method of is not officially documented… (but has been quite stable so far).
EOL
this method doesn't seem to escape characters like "’" on google app engine, though it works locally on python2.6. It does still decode entities (like ") at least
gfxmonk
+8  A: 

BeautifulSoup handles entity conversion:

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>
Ben James
BeautifulStoneSoup is for XML parsing. Use BeautifulSoup for HTML.
interjay
+1. No idea how I missed this in the docs: thanks for the info. I'm going to accept luc's answer tho because his uses the standard lib which I specified in the question (not important to me) and its probably of more general use to other people.
jkp
interjay: fixed, the same applies to `BeautifulSoup` also.
Ben James
jkp: Actually I think you are not helping people, who may continue to believe BeautifulSoup can't handle entities properly by seeing that accepted answer. If an assumption you made in your question (i.e. that BeautifulSoup couldn't do it) was incorrect, you can always edit and point that out.
Ben James
@Ben James: fair point, have edited my question to point that out. btw: I'm using your solution for what it's worth! Thanks again.
jkp