A: 

I think somewhere, a program is quoting &deg and &reg without a semicolon. Try to use "&deg"+";" and "&reg"+";" in your HTML file, if it indeed is an HTML file. And please explain the context.

Marco Mariani
Suhail
+2  A: 

Here's a script which I wrote for tolerant unescaping of HTML references from web pages - it assumes that the references are e.g. in ° format with a semicolon after them though (Preheat oven to 350° F for example), but I thought maybe you had trouble with the stack overflow formatting outputting as actual HTML references when writing the question (correct me if I'm wrong):

from htmlentitydefs import name2codepoint

# Get the whitespace characters
DNums = {0: ' ', 1: '\t', 2: '\r', 3: '\n'}
DChars = dict((x, y) for y, x in DNums.items())
DNums2XML = {0: ' ', 1: '	', 2: '
', 3: '
'}
DChars2XML = dict((DNums[i], DNums2XML[i]) for i in DNums2XML)

S = '1234567890ABCDEF'
DHex = {}
for i in S:
    DHex[i.lower()] = None
    DHex[i.upper()] = None
del S

def IsHex(S):
    if not S: return False
    for i in S: 
        if i not in DHex:
            return False
    return True

class CUnescape:
    def __init__(self, S, ignoreWS=False):
        # Converts HTML character references into a unicode string to allow manipulation
        self.S = S
        self.ignoreWS = ignoreWS
        self.L = self.process(ignoreWS)

    def process(self, ignoreWS):
        def getChar(c):
            if ignoreWS:
                return c
            else:
                if c in DChars:
                    return DChars[c]
                else: return c

        LRtn = []
        L = self.S.split('&')
        xx = 0
        yy = 0
        for iS in L:
            if xx:
                LSplit = iS.split(';')
                if LSplit[0].lower() in name2codepoint:
                    # A character reference, e.g. '&'
                    a = unichr(name2codepoint[LSplit[0].lower()])
                    LRtn.append(getChar(a)) # TOKEN CHECK?
                    LRtn.append(';'.join(LSplit[1:]))

                elif LSplit[0] and LSplit[0][0] == '#' and LSplit[0][1:].isdigit():
                    # A character number e.g. '4'
                    a = unichr(int(LSplit[0][1:]))
                    LRtn.append(getChar(a))
                    LRtn.append(';'.join(LSplit[1:]))

                elif LSplit[0] and LSplit[0][0] == '#' and LSplit[0][1:2].lower() == 'x' and IsHex(LSplit[0][2:]):
                    # A hexadecimal encoded character
                    a = unichr(int(LSplit[0][2:].lower(), 16)) # Hex -> base 16
                    LRtn.append(getChar(a))
                    LRtn.append(';'.join(LSplit[1:]))

                else: LRtn.append('&%s' % ';'.join(LSplit))
            else: LRtn.append(iS)
            xx += 1
            yy += len(LRtn[-1])
        return LRtn

    def getValue(self):
        # Convert back into HTML, preserving 
        # whitespace if self.ignoreWS is `False`
        L = []
        for i in self.L:
            if type(i) == int:
                L.append(DNums2XML[i])
            else:
                L.append(i)
        return ''.join(L)

def Unescape(S):
    # Get the string value from escaped HTML `S`, ignoring 
    # explicit whitespace like tabs/spaces etc
    IUnescape = CUnescape(S, ignoreWS=True)
    return ''.join(IUnescape.L)

if __name__ == '__main__':
    print Unescape('Preheat oven to 350° F')
    print Unescape('Welcome to Lorem Ipsum Inc®')

EDIT: From the complaints of the original questioner in the comments, I'll post a simpler solution which only replaces the character references with characters and not &#xx; references:

from htmlentitydefs import name2codepoint

def unescape(s):
    for name in name2codepoint:
        s = s.replace('&%s;' % name, unichr(name2codepoint[name]))
    return s

print unescape('Preheat oven to 350° F')
print unescape('Welcome to Lorem Ipsum Inc®')

That's it though, I'm doing this for free after all :-P

David Morrissey
no no, its not like that, let me show you the code:
Suhail
recipeDiv= BeautifulSoup.findAll('div', attrs={'id': 'preparation'}) recipeDiv= str(recipeDiv) recipeDiv= BeautifulSoup(recipeDiv) RN= len(recipeDiv('p')) y=0 while (y < RN): recipeDivDict= {} recipeDivText= str(strip_tags(recipeDiv('p')[y])) recipeDivText= recipeDivText.strip() recipeDivText= recipeDivText.strip('\n') print recipeDivText recipeDivDict['directions']= recipeDivText RecipeList.append(recipeDivDict) y = y + 1
Suhail
when i print the recipeDivText i get the output like :Preheat oven to 350°F
Suhail
David Morrissey
The problem is BeautifulSoup returns the HTML contents without unescaping the HTML character references (http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references) - the above does that
David Morrissey
I can't use that code in mine, its like i just want to escape the HTML characters in the string and display their actual symbol, for that i cant use the code you have mentioned above, i am assuming there must be some simpler solution for this.
Suhail
+3  A: 
$ python -c'from BeautifulSoup import BeautifulSoup
> print BeautifulSoup("""<html>Preheat oven to 350&deg; F
> Welcome to Lorem Ipsum Inc&reg;""",
> convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0].string'
Preheat oven to 350° F
Welcome to Lorem Ipsum Inc®
J.F. Sebastian
+1 this solution is way simpler than mine yet still achieves a similar result - please use this one
David Morrissey
sorry, please don't get me wrong, all i want is to get the symbols, of the registered and the degree celsius. i'll try the one mentioned above.
Suhail
Thanks Sebastian that solved my issue, and thanks David for co-operating with me.
Suhail
@Suhail: no problems, so long as you solved it :-)
David Morrissey