views:

1845

answers:

3

I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out what I am doing wrong.

Take for example:

"U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"

I've tried BeautifulSoup, decode('iso-8859-1'), and django.utils.encoding's smart_str without any success.

+1  A: 

Try this:

import re

def _callback(matches):
    id = matches.group(1)
    try:
        return unichr(int(id))
    except:
        return id

def decode_unicode_references(data):
    return re.sub("&#(\d+)(;|(?=\s))", _callback, data)

data = "U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"
print decode_unicode_references(data)
Evan Fosmark
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 12: character maps to <undefined>This seems to be the error I keep getting regardless of what I try.
KeyboardInterrupt
Could you provide more code, then? I just tried it with the function I wrote and the character 2019 works fine. It shows up as: ߣ
Evan Fosmark
A few questions on your regexp: (1) Shouldn't it be \d instead of \w? The regexp will match ` ` and ` ` but then it will crash in int() (2) Allowing the character reference (it's NOT an entity) to end in a whitespace instead of ';' seems very tolerant -- shouldn't you mention this? (3) Wouldn't the last part be better written as [;\s]?
John Machin
John, you were correct on point one *partially*. It won't match   since that doesn't start with ``, but yes it should have been `\d`. Regarding point two to allowing it to end with whitespace, it should be noted that even though it isn't pretty, it's still supported. I've updated the code in the following way: (1) Changed it to `\d`, (2) made the callback a bit stronger, and (3) used a lookahead assertion for ending whitespace instead of absorbing it like it was.
Evan Fosmark
Evan, thanks for the enlightenment, especially about the tolerance of whitespace, which I didn't know about. I got some more clues by looking in the HTML 4.01 and 2.0 specs. They referred to the SGML standard (ISO 8879). Cost = CHF 238(!) so I didn't read it, but HTML 2.0 commented that ';' is only needed when the character following the reference would otherwise be part of the name. Experiments with FF, IE and Opera using space - / X A and ` all gave the same result: they terminate the reference and are not swallowed. I'm looking forward to your updated solution ;-)
John Machin
+2  A: 

This does work:

from BeautifulSoup import BeautifulStoneSoup
s = "U.S. Adviser&#8217;s Blunt Memo on Iraq: Time &#8216;to Go Home&#8217;"
decoded = BeautifulStoneSoup(s, convertEntities=BeautifulStoneSoup.HTML_ENTITIES)

If you want a string instead of a Unicode object, you'll need to decode it to an encoding that supports the characters being used; ISO-8859-1 doesn't:

result = decoded.encode("UTF-8")

It's unfortunate that you need an external module for something like this; simple HTML/XML entity decoding should be in the standard library, and not require me to use a library with meaningless class names like "BeautifulStoneSoup". (Class and function names should not be "creative", they should be meaningful.)

Glenn Maynard
lxml, alas also not in the standard library, also provides a Beautiful Soup parser (and lots more) with somewhat less "creative" names.
Ned Deily
Support for entity decoding is in the standard library (module htmlentitydefs). What the OP has are (decimal) numeric character references, not entities.
John Machin
+5  A: 

Actually what you have are not HTML entities. There are THREE varieties of those &.....; thingies -- for example &#160; &#xa0; &nbsp; all mean U+00A0 NO-BREAK SPACE.

&#160; (the type you have) is a "numeric character reference" (decimal).
&#xa0; is a "numeric character reference" (hexadecimal).
&nbsp; is an entity.

Further reading: http://htmlhelp.com/reference/html40/entities/

Here you will find code for Python2.x that does all three in one scan through the input:

http://effbot.org/zone/re-sub.htm#unescape-html

John Machin