I have a string with symbols like this:
'
That's an apostrophe apparently.
I tried saxutils.unescape() without any luck and tried urllib.unquote()
How can I decode this? Thanks!
I have a string with symbols like this:
'
That's an apostrophe apparently.
I tried saxutils.unescape() without any luck and tried urllib.unquote()
How can I decode this? Thanks!
Check out this question. What you're looking for is "html entity decoding". Typically, you'll find a function named something like "htmldecode" that will do what you want. Both Django and Cheetah provide such functions as does BeautifulSoup.
The other answer will work just great if you don't want to use a library and all the entities are numeric.
The most robust solution seems to be this function by Python luminary Fredrik Lundh. It is not the shortest solution, but it handles named entities as well as hex and decimal codes.
Try this: (found it here)
from htmlentitydefs import name2codepoint as n2cp
import re
def decode_htmlentities(string):
"""
Decode HTML entities–hex, decimal, or named–in a string
@see http://snippets.dzone.com/posts/show/4569
>>> u = u'E tu vivrai nel terrore - L'aldilà (1981)'
>>> print decode_htmlentities(u).encode('UTF-8')
E tu vivrai nel terrore - L'aldilà (1981)
>>> print decode_htmlentities("l'eau")
l'eau
>>> print decode_htmlentities("foo < bar")
foo < bar
"""
def substitute_entity(match):
ent = match.group(3)
if match.group(1) == "#":
# decoding by number
if match.group(2) == '':
# number is in decimal
return unichr(int(ent))
elif match.group(2) == 'x':
# number is in hex
return unichr(int('0x'+ent, 16))
else:
# they were using a name
cp = n2cp.get(ent)
if cp: return unichr(cp)
else: return match.group()
entity_re = re.compile(r'&(#?)(x?)(\w+);')
return entity_re.subn(substitute_entity, string)[0]