views:

46

answers:

2

The following Python code ...

html_data = urllib2.urlopen(some_url).read()
f = codecs.open(filename, 'w', encoding='utf-8')
f.write(html_data)
f.close()

... sometimes fails with UnicodeDecodeError ...

File "/.../lib/python2.6/codecs.py", line 686, in write
  return self.writer.write(data)
File "/.../lib/python2.6/codecs.py", line 351, in write
  data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 5605: ordinal not in range(128)

My questions:

  • How do I make sure my urllib2.urlopen(some_url).read() call always returns UTF-8?
  • Is there anything wrong with my codecs.open(...) call that prevents it from storing my data to disk in UTF-8 encoding?
+1  A: 
  1. AFAIK, You cannot do that. However, You can detect encoding from headers / html and re-encode.
  2. I don't know. I have always used binary mode for writing and it always worked

Example:

data = urlopen(uri).read().decode(encoding)
f = open(file_name, 'wb')
f.write(data.encode('utf-8'))
f.close()
Almad
+1  A: 

The problem is not with codecs.open -- it's with passing to .write a byte string that (given the \xd0 code in it) is clearly encoded in some ISO-8859-* or related codec.

urllib2.urlopen returns a response object which, besides file-like behavior, as the extra method:

info() — return the meta-information of the page, such as headers, in the form of an httplib.HTTPMessage instance (see Quick Reference to HTTP Headers)

In particular the Content-Type header, for text-like contents, should have a charset parameter specifying the encoding it uses, e.g. Content-Type: text/html; charset=ISO-8859-4. You need to parse and isolate the charset and use it to decode the contents into Unicode (so your codecs.opened file-like object always gets unicode arguments to write and properly writes them out in utf-8).

If charset is missing, or using it to decode the text results in errors (suggesting charset is wrong), as the last hope of salvation you can try the Universal Encoding Detector which uses heuristics for the purpose (after all, many pages on the web have horrible metadata errors, as well as broken HTML and so forth).

Alex Martelli
Technically, the default charset for HTML is "iso-8859-1", so if there is no charset declared, it should be iso-8859-1. Of course, HTML is a wild and wooly world, so there's no guarantee that a document served with no charset is actually in iso-8859-1.
Ned Batchelder
@Ned, good point, but I'd _still_ try the UED just in case (of course it can't distinguish between the various ISO-8859-etc classes or cognates like CP1252, but at least it's a "plan B" for safety;-).
Alex Martelli