I think somewhere, a program is quoting ° and ® without a semicolon. Try to use "°"+";" and "®"+";" in your HTML file, if it indeed is an HTML file. And please explain the context.
A:
Marco Mariani
2010-05-19 12:20:17
Suhail
2010-05-19 12:47:02
+2
A:
Here's a script which I wrote for tolerant unescaping of HTML references from web pages - it assumes that the references are e.g. in °
format with a semicolon after them though (Preheat oven to 350° F
for example), but I thought maybe you had trouble with the stack overflow formatting outputting as actual HTML references when writing the question (correct me if I'm wrong):
from htmlentitydefs import name2codepoint
# Get the whitespace characters
DNums = {0: ' ', 1: '\t', 2: '\r', 3: '\n'}
DChars = dict((x, y) for y, x in DNums.items())
DNums2XML = {0: ' ', 1: '	', 2: ' ', 3: ' '}
DChars2XML = dict((DNums[i], DNums2XML[i]) for i in DNums2XML)
S = '1234567890ABCDEF'
DHex = {}
for i in S:
DHex[i.lower()] = None
DHex[i.upper()] = None
del S
def IsHex(S):
if not S: return False
for i in S:
if i not in DHex:
return False
return True
class CUnescape:
def __init__(self, S, ignoreWS=False):
# Converts HTML character references into a unicode string to allow manipulation
self.S = S
self.ignoreWS = ignoreWS
self.L = self.process(ignoreWS)
def process(self, ignoreWS):
def getChar(c):
if ignoreWS:
return c
else:
if c in DChars:
return DChars[c]
else: return c
LRtn = []
L = self.S.split('&')
xx = 0
yy = 0
for iS in L:
if xx:
LSplit = iS.split(';')
if LSplit[0].lower() in name2codepoint:
# A character reference, e.g. '&'
a = unichr(name2codepoint[LSplit[0].lower()])
LRtn.append(getChar(a)) # TOKEN CHECK?
LRtn.append(';'.join(LSplit[1:]))
elif LSplit[0] and LSplit[0][0] == '#' and LSplit[0][1:].isdigit():
# A character number e.g. '4'
a = unichr(int(LSplit[0][1:]))
LRtn.append(getChar(a))
LRtn.append(';'.join(LSplit[1:]))
elif LSplit[0] and LSplit[0][0] == '#' and LSplit[0][1:2].lower() == 'x' and IsHex(LSplit[0][2:]):
# A hexadecimal encoded character
a = unichr(int(LSplit[0][2:].lower(), 16)) # Hex -> base 16
LRtn.append(getChar(a))
LRtn.append(';'.join(LSplit[1:]))
else: LRtn.append('&%s' % ';'.join(LSplit))
else: LRtn.append(iS)
xx += 1
yy += len(LRtn[-1])
return LRtn
def getValue(self):
# Convert back into HTML, preserving
# whitespace if self.ignoreWS is `False`
L = []
for i in self.L:
if type(i) == int:
L.append(DNums2XML[i])
else:
L.append(i)
return ''.join(L)
def Unescape(S):
# Get the string value from escaped HTML `S`, ignoring
# explicit whitespace like tabs/spaces etc
IUnescape = CUnescape(S, ignoreWS=True)
return ''.join(IUnescape.L)
if __name__ == '__main__':
print Unescape('Preheat oven to 350° F')
print Unescape('Welcome to Lorem Ipsum Inc®')
EDIT: From the complaints of the original questioner in the comments, I'll post a simpler solution which only replaces the character references with characters and not &#xx;
references:
from htmlentitydefs import name2codepoint
def unescape(s):
for name in name2codepoint:
s = s.replace('&%s;' % name, unichr(name2codepoint[name]))
return s
print unescape('Preheat oven to 350° F')
print unescape('Welcome to Lorem Ipsum Inc®')
That's it though, I'm doing this for free after all :-P
David Morrissey
2010-05-19 12:39:00
recipeDiv= BeautifulSoup.findAll('div', attrs={'id': 'preparation'}) recipeDiv= str(recipeDiv) recipeDiv= BeautifulSoup(recipeDiv) RN= len(recipeDiv('p')) y=0 while (y < RN): recipeDivDict= {} recipeDivText= str(strip_tags(recipeDiv('p')[y])) recipeDivText= recipeDivText.strip() recipeDivText= recipeDivText.strip('\n') print recipeDivText recipeDivDict['directions']= recipeDivText RecipeList.append(recipeDivDict) y = y + 1
Suhail
2010-05-19 12:51:38
when i print the recipeDivText i get the output like :Preheat oven to 350°F
Suhail
2010-05-19 12:52:33
David Morrissey
2010-05-19 12:55:53
The problem is BeautifulSoup returns the HTML contents without unescaping the HTML character references (http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references) - the above does that
David Morrissey
2010-05-19 13:02:11
I can't use that code in mine, its like i just want to escape the HTML characters in the string and display their actual symbol, for that i cant use the code you have mentioned above, i am assuming there must be some simpler solution for this.
Suhail
2010-05-20 03:57:05
+3
A:
$ python -c'from BeautifulSoup import BeautifulSoup
> print BeautifulSoup("""<html>Preheat oven to 350° F
> Welcome to Lorem Ipsum Inc®""",
> convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0].string'
Preheat oven to 350° F
Welcome to Lorem Ipsum Inc®
J.F. Sebastian
2010-05-20 04:55:58
+1 this solution is way simpler than mine yet still achieves a similar result - please use this one
David Morrissey
2010-05-20 05:02:10
sorry, please don't get me wrong, all i want is to get the symbols, of the registered and the degree celsius. i'll try the one mentioned above.
Suhail
2010-05-20 06:03:28
Thanks Sebastian that solved my issue, and thanks David for co-operating with me.
Suhail
2010-05-20 06:45:32