ansaurus

Question

How to unescape special characters from BeautifulSoup output?

Answer 1

A:

I think somewhere, a program is quoting &deg and &reg without a semicolon. Try to use "&deg"+";" and "&reg"+";" in your HTML file, if it indeed is an HTML file. And please explain the context.

Marco Mariani 2010-05-19 12:20:17

Suhail 2010-05-19 12:47:02

Answer 2

+2 A:

Here's a script which I wrote for tolerant unescaping of HTML references from web pages - it assumes that the references are e.g. in ° format with a semicolon after them though (Preheat oven to 350° F for example), but I thought maybe you had trouble with the stack overflow formatting outputting as actual HTML references when writing the question (correct me if I'm wrong):

from htmlentitydefs import name2codepoint

# Get the whitespace characters
DNums = {0: ' ', 1: '\t', 2: '\r', 3: '\n'}
DChars = dict((x, y) for y, x in DNums.items())
DNums2XML = {0: '&#32;', 1: '&#09;', 2: '&#13;', 3: '&#10;'}
DChars2XML = dict((DNums[i], DNums2XML[i]) for i in DNums2XML)

S = '1234567890ABCDEF'
DHex = {}
for i in S:
    DHex[i.lower()] = None
    DHex[i.upper()] = None
del S

def IsHex(S):
    if not S: return False
    for i in S: 
        if i not in DHex:
            return False
    return True

class CUnescape:
    def __init__(self, S, ignoreWS=False):
        # Converts HTML character references into a unicode string to allow manipulation
        self.S = S
        self.ignoreWS = ignoreWS
        self.L = self.process(ignoreWS)

    def process(self, ignoreWS):
        def getChar(c):
            if ignoreWS:
                return c
            else:
                if c in DChars:
                    return DChars[c]
                else: return c

        LRtn = []
        L = self.S.split('&')
        xx = 0
        yy = 0
        for iS in L:
            if xx:
                LSplit = iS.split(';')
                if LSplit[0].lower() in name2codepoint:
                    # A character reference, e.g. '&amp;'
                    a = unichr(name2codepoint[LSplit[0].lower()])
                    LRtn.append(getChar(a)) # TOKEN CHECK?
                    LRtn.append(';'.join(LSplit[1:]))

                elif LSplit[0] and LSplit[0][0] == '#' and LSplit[0][1:].isdigit():
                    # A character number e.g. '&#52;'
                    a = unichr(int(LSplit[0][1:]))
                    LRtn.append(getChar(a))
                    LRtn.append(';'.join(LSplit[1:]))

                elif LSplit[0] and LSplit[0][0] == '#' and LSplit[0][1:2].lower() == 'x' and IsHex(LSplit[0][2:]):
                    # A hexadecimal encoded character
                    a = unichr(int(LSplit[0][2:].lower(), 16)) # Hex -> base 16
                    LRtn.append(getChar(a))
                    LRtn.append(';'.join(LSplit[1:]))

                else: LRtn.append('&%s' % ';'.join(LSplit))
            else: LRtn.append(iS)
            xx += 1
            yy += len(LRtn[-1])
        return LRtn

    def getValue(self):
        # Convert back into HTML, preserving 
        # whitespace if self.ignoreWS is `False`
        L = []
        for i in self.L:
            if type(i) == int:
                L.append(DNums2XML[i])
            else:
                L.append(i)
        return ''.join(L)

def Unescape(S):
    # Get the string value from escaped HTML `S`, ignoring 
    # explicit whitespace like tabs/spaces etc
    IUnescape = CUnescape(S, ignoreWS=True)
    return ''.join(IUnescape.L)

if __name__ == '__main__':
    print Unescape('Preheat oven to 350&deg; F')
    print Unescape('Welcome to Lorem Ipsum Inc&reg;')

EDIT: From the complaints of the original questioner in the comments, I'll post a simpler solution which only replaces the character references with characters and not &#xx; references:

from htmlentitydefs import name2codepoint

def unescape(s):
    for name in name2codepoint:
        s = s.replace('&%s;' % name, unichr(name2codepoint[name]))
    return s

print unescape('Preheat oven to 350&deg; F')
print unescape('Welcome to Lorem Ipsum Inc&reg;')

That's it though, I'm doing this for free after all :-P

David Morrissey 2010-05-19 12:39:00

no no, its not like that, let me show you the code:

Suhail 2010-05-19 12:50:43

recipeDiv= BeautifulSoup.findAll('div', attrs={'id': 'preparation'}) recipeDiv= str(recipeDiv) recipeDiv= BeautifulSoup(recipeDiv) RN= len(recipeDiv('p')) y=0 while (y < RN): recipeDivDict= {} recipeDivText= str(strip_tags(recipeDiv('p')[y])) recipeDivText= recipeDivText.strip() recipeDivText= recipeDivText.strip('\n') print recipeDivText recipeDivDict['directions']= recipeDivText RecipeList.append(recipeDivDict) y = y + 1

Suhail 2010-05-19 12:51:38

when i print the recipeDivText i get the output like :Preheat oven to 350°F

Suhail 2010-05-19 12:52:33

David Morrissey 2010-05-19 12:55:53

The problem is BeautifulSoup returns the HTML contents without unescaping the HTML character references (http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references) - the above does that

David Morrissey 2010-05-19 13:02:11

I can't use that code in mine, its like i just want to escape the HTML characters in the string and display their actual symbol, for that i cant use the code you have mentioned above, i am assuming there must be some simpler solution for this.

Suhail 2010-05-20 03:57:05

Answer 3

+3 A:

$ python -c'from BeautifulSoup import BeautifulSoup
> print BeautifulSoup("""<html>Preheat oven to 350&deg; F
> Welcome to Lorem Ipsum Inc&reg;""",
> convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0].string'
Preheat oven to 350° F
Welcome to Lorem Ipsum Inc®

J.F. Sebastian 2010-05-20 04:55:58

+1 this solution is way simpler than mine yet still achieves a similar result - please use this one

David Morrissey 2010-05-20 05:02:10

sorry, please don't get me wrong, all i want is to get the symbols, of the registered and the degree celsius. i'll try the one mentioned above.

Suhail 2010-05-20 06:03:28

Thanks Sebastian that solved my issue, and thanks David for co-operating with me.

Suhail 2010-05-20 06:45:32

@Suhail: no problems, so long as you solved it :-)

David Morrissey 2010-05-20 06:47:16

ansaurus

tags:

views:

answers:

How to unescape special characters from BeautifulSoup output?

related questions