views:

1684

answers:

4

I have a string with symbols like this:

'

That's an apostrophe apparently.

I tried saxutils.unescape() without any luck and tried urllib.unquote()

How can I decode this? Thanks!

A: 

I am not sure about the & or the #, but here is some code for decoding:

>>>chr(39)
"'"
>>>ord("'")
39
+2  A: 

Check out this question. What you're looking for is "html entity decoding". Typically, you'll find a function named something like "htmldecode" that will do what you want. Both Django and Cheetah provide such functions as does BeautifulSoup.

The other answer will work just great if you don't want to use a library and all the entities are numeric.

easel
thanks. what does Django have? because i looked in the docs but couldnt' find anything...
rick
It's called django.utils.html.escape, apparently. Check out the other stackoverflow question I linked for some more details.
easel
it looks like django.utils.html.escape only works to encode, not decode. i ended up using BeautifulSoup. thanks
rick
+1  A: 

The most robust solution seems to be this function by Python luminary Fredrik Lundh. It is not the shortest solution, but it handles named entities as well as hex and decimal codes.

John Y
A: 

Try this: (found it here)

from htmlentitydefs import name2codepoint as n2cp
import re

def decode_htmlentities(string):
    """
    Decode HTML entities–hex, decimal, or named–in a string
    @see http://snippets.dzone.com/posts/show/4569

    >>> u = u'E tu vivrai nel terrore - L'aldilà (1981)'
    >>> print decode_htmlentities(u).encode('UTF-8')
    E tu vivrai nel terrore - L'aldilà (1981)
    >>> print decode_htmlentities("l'eau")
    l'eau
    >>> print decode_htmlentities("foo < bar")                
    foo < bar
    """
    def substitute_entity(match):
        ent = match.group(3)
        if match.group(1) == "#":
            # decoding by number
            if match.group(2) == '':
                # number is in decimal
                return unichr(int(ent))
            elif match.group(2) == 'x':
                # number is in hex
                return unichr(int('0x'+ent, 16))
        else:
            # they were using a name
            cp = n2cp.get(ent)
            if cp: return unichr(cp)
            else: return match.group()

    entity_re = re.compile(r'&(#?)(x?)(\w+);')
    return entity_re.subn(substitute_entity, string)[0]
Adrian Mester