views:

131

answers:

1

I would expect the output of getencoding in the following python session to be "ISO-8859-1":

>>> import urllib2
>>> response = urllib2.urlopen("http://www.google.com/")
>>> response.info().plist
['charset=ISO-8859-1']
>>> response.info().getencoding()
'7bit'

This is with python version 2.6 ('2.6 (r26:66714, Aug 17 2009, 16:01:07) \n[GCC 4.0.1 (Apple Inc. build 5484)]' specifically).

A: 

Well, what is it you think is broken?

I get ISO-8859-2 for both urllib and wget (I'm currently in Poland). I get UTF-8 with Firefox. This is because my Firefox tells the site it accepts ISO-8859-1 and UTF-8, while wget and urllib2 does not say anything. The relevant request header is:

Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

Remove UTF-8 from that, and you won't get UTF-8, easily testable by telnetting to port 80.

Google.com simply (and reasonably) defaults to ISO-8859-1 and google.pl to ISO-8859-2, and I'm sure there are other defaults for other sites.

I get no encoding header either for wget, urllib2 or telnet, I guess urllib2 then assumes 7bit, and this may be a bit non-sensical, as Content-Encoding typically is either gzip or nothing.

Lennart Regebro