ansaurus

Question

Answer 1

+1 A:

The web page may be lying about its encoding. The output looks like UTF-8. If you got a str at the end then you'll need to decode it as UTF-8. If you have a unicode instead then you'll need to encode as Latin-1 first.

Ignacio Vazquez-Abrams 2009-03-09 22:53:49

Answer 2

+1 A:

The in the source of the web page it looks like this: The Children of HÃºrin. So the encoding is already broken somewhere on their side before it even gets converted to XML...

If it's a general issue with all the books and you need to work around it, this seems to work:

unicode(title_field.find('fact').string).encode("latin1").decode("utf-8")

sth 2009-03-09 23:05:28

Yup, I guess that's it. I've contacted LibraryThing about sorting it out. Thanks. :)

Daniel Watkins 2009-03-09 23:21:06

ansaurus

tags:

views:

answers:

Decoding HTML Entities With Python

related questions