views:

464

answers:

6

I have looked all around and only found solutions for python 2.6 and earlier, NOTHING on how to do this in python 3.X. (I only have access to Win7 box.)

I HAVE to be able to do this in 3.1 and preferably without external libraries. Currently, I have httplib2 installed and access to command-prompt curl (that's how I'm getting the source code for pages). Unfortunately, curl does not decode html entities, as far as I know, I couldn't find a command to decode it in the documentation.

YES, I've tried to get Beautiful Soup to work, MANY TIMES without success in 3.X. If you could provide EXPLICIT instructions on how to get it to work in python 3 in MS Windows environment, I would be very grateful.

So, to be clear, I need to turn strings like this: Suzy & John into a string like this: "Suzy & John".

+1  A: 

xml.sax.saxutils.unescape

For example,

In [1]: import xml.sax.saxutils

In [2]: xml.sax.saxutils.unescape('Suzy & John')
Out[2]: 'Suzy & John'
unutbu
Awesome! However, I see that only unescapes certain characters. For example, the ampersand remains escaped. Could you explain why this is? How do I unescape these characters?
Sho Minamimoto
@Sho Minamimoto: I added an example. Hope it helps?
unutbu
Yeah, I got it, thanks!
Sho Minamimoto
@Sho Minamimoto: Great! :-)
unutbu
+1  A: 

Python 3.x has html.entities too

S.Mark
+1  A: 

I am not sure if this is a built in library or not but it looks like what you need and supports 3.1.

From: http://docs.python.org/3.1/library/xml.sax.utils.html?highlight=html%20unescape

xml.sax.saxutils.unescape(data, entities={}) Unescape '&', '<', and '>' in a string of data.

Jacob

TheJacobTaylor
+1  A: 

You can use xml.sax.saxutils.unescape for this purpose. This module is included in the Python standard library, and is portable between Python 2.x and Python 3.x.

>>> import xml.sax.saxutils as saxutils
>>> saxutils.unescape("Suzy &amp; John")
'Suzy & John'
Greg Hewgill
A: 

Hi

I'm having the same problem but with other characters too. For example ' saxutils solution does solve the problem with & but not all characters that your browser (or your crawler, as in my case) might find on its way.

Any idea?

marcorossi
A: 
Derrick Petzold