ansaurus

Question

How do I write data to disk in UTF-8 encoding in Python?

Answer 1

+1 A:

AFAIK, You cannot do that. However, You can detect encoding from headers / html and re-encode.
I don't know. I have always used binary mode for writing and it always worked

Example:

data = urlopen(uri).read().decode(encoding)
f = open(file_name, 'wb')
f.write(data.encode('utf-8'))
f.close()

Almad 2010-06-27 19:51:00

Answer 2

+1 A:

The problem is not with codecs.open -- it's with passing to .write a byte string that (given the \xd0 code in it) is clearly encoded in some ISO-8859-* or related codec.

urllib2.urlopen returns a response object which, besides file-like behavior, as the extra method:

info() — return the meta-information of the page, such as headers, in the form of an httplib.HTTPMessage instance (see Quick Reference to HTTP Headers)

In particular the Content-Type header, for text-like contents, should have a charset parameter specifying the encoding it uses, e.g. Content-Type: text/html; charset=ISO-8859-4. You need to parse and isolate the charset and use it to decode the contents into Unicode (so your codecs.opened file-like object always gets unicode arguments to write and properly writes them out in utf-8).

If charset is missing, or using it to decode the text results in errors (suggesting charset is wrong), as the last hope of salvation you can try the Universal Encoding Detector which uses heuristics for the purpose (after all, many pages on the web have horrible metadata errors, as well as broken HTML and so forth).

Alex Martelli 2010-06-27 20:01:44

Technically, the default charset for HTML is "iso-8859-1", so if there is no charset declared, it should be iso-8859-1. Of course, HTML is a wild and wooly world, so there's no guarantee that a document served with no charset is actually in iso-8859-1.

Ned Batchelder 2010-06-27 20:32:30

@Ned, good point, but I'd _still_ try the UED just in case (of course it can't distinguish between the various ISO-8859-etc classes or cognates like CP1252, but at least it's a "plan B" for safety;-).

Alex Martelli 2010-06-27 20:45:10

ansaurus

tags:

views:

answers:

How do I write data to disk in UTF-8 encoding in Python?

related questions