tags:

views:

76

answers:

3

Here's my problem:

import urllib2

response=urllib2.urlopen('http://proxy-heaven.blogspot.com/')
html=response.read()

print html

It's just this site, and I don't know why the result is all garbled characters. Anyone can help?

+1  A: 

Without your output it's hard to say but I'd bet it's an encoding issue : this website is encoded in utf8. If your terminal is set in iso-latin for example, it won't be possible for it to display characters properly.

Guillaume Lebourgeois
I've tried several sites with content="text/html; charset=utf-8", which means they are utf-8 encoded, right? The weird thing is all these sites work fine except for this special one: http://proxy-heaven.blogspot.com/, I really want to know why...
Shane
No, it means the developper have just written it was ; it could be in fact in anything else. The only way to be sure of the encoding is to detect it.
Guillaume Lebourgeois
Well, I'm just starting with python, so would you give some examples on how to decode/detect it? And by the way, everything is fine in my browser using UTF-8 character encoding, does it mean it's UTF-8 encoded?
Shane
Sure, there is a great library for that : BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/). I recommand you to set all tour environment in utf8 (your python files, your terminal, ...)
Guillaume Lebourgeois
Would you mind giving me several lines of code on how to make it work all right? I've tried to use BeautifulSoup and html.decode('utf-8') but still cannot get it work.
Shane
If your terminal is setted in utf8, you will want to print html.encode('utf8'), and not decode.
Guillaume Lebourgeois
+1  A: 

Works for me:

import urllib
response=urllib.urlopen('http://proxy-heaven.blogspot.com/')
a = response.read()
print a[:50]

> '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Stric'

You may have an encoding problem in your terminal, though.

loevborg
What can I do to fix it? Weird... I didn't do anything with my python 2.6.5, how come it's not working?
Shane
Well that obviously depends on the platform you're working on, the method of installing python, etc. Does the problem appear in an interactive python session such as the one I pasted above?
loevborg
Note, also, that the choice of urllib and urllib2 may make a difference (in your example both were mentioned.)
loevborg
Just uninstalled 2.6.5 and installed 2.7, problem remains unsolved. Yes it appears in an interactive python session, i'm using default python shell under XP
Shane
You reminds me, should be urllib2 all the time
Shane
Well, the problem now disappears, and everything works fine. The last minute when I was running the 4 line code it gave me garbled characters, now everything is OK. Weird... Bugging me several hours for no reason
Shane
Glad it worked for you. I suspect it had more to do with the Windows command shell or the the Windows varian of python than with the python code itself.
loevborg
A: 

encoding may be your problem, in which case you want this code.

import urllib
s = str(urllib.urlopen('http://proxy-heaven.blogspot.com/').read(), encoding='utf8')
Zonda333