views:

660

answers:

2

I am fetching a webpage (http://autoweek.com) and trying to process it but getting encoding error. Autoweek declares "iso-8859-1" encoding and has the word "Nürburgring" (u with umlaut)

I do:

# -*- encoding: utf-8 -*-
import urllib
webpage = urllib.urlopen(feed.crawl_url).read()
webpage.decode("utf-8")

it gives me the following error:

'utf8' codec can't decode bytes in position 7768-7773: unsupported Unicode code range"

if I bypass .decode step and do some parsing with lxml library, it raises an error when I am saving parsed title to database:

'utf8' codec can't decode bytes in position 45-50: unsupported Unicode code range

My database has character set utf8 and collation utf-general-ci

My settings:
Django
Python 2.4.3
MySQL 5.0.22
MySQL-python 1.2.1
mod_python 3.2.8

+3  A: 

If the webpage declares encoding iso-8859-1, can't you just do webpage.decode("iso-8859-1")?

At that point, webpage is decoded for your app. When it is written into the database, the mapping there should handle the char-to-utf8 encoding.

To get the correct encoding, either tell the webserver that you only accept, say, UTF-8 and then that's what you'll (hopefully) always get, since just about everyone reads UTF-8 (or you could try it with ISO-8859-1); or use .info to inspect the encoding name of the stream returned.

See urllib2 - The Missing Manual and Quick reference to HTTP headers for details.

lavinio
I need serialize this solution for all pages (of different encodings) I am fetching. So I have to fetch, extract encoding (if it's declared) and then decode.Any easier solution?
Yury Lifshits
No. It's the only solution, unless you want to throw away the incorrect characters. And it's honestly not very complicated.
Lennart Regebro
+1 you need to decode using iso-8859-1. I've verified this against your URL and it works fine.
mhawke
A: 

autoweek.com seems confused about it's own encoding. It declares conflicting charset definitions:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

and later...

<meta charset=iso-8859-1"/>.

iso-8859-1 is the correct one since this is returned in the header from the web server and by the .info() method (and it actually decodes), but this demonstrates that you can't necessarily rely on the Content-Type declaration in web pages. You should follow the method described by lavinio.

mhawke