ansaurus

Question

Unicode problems with web pages in Python's urllib

Answer 1

+1 A:

That text is indeed iso-88591-1, and I can decode it without a problem, and indeed your code runs without a hitch.

Your error, however, is an ENCODE error, not a decode error. And you don't do any encoding in your code, so. Possibly you have gotten encoding and decoding confused, it's a common problem.

You DECODE from Latin1 to Unicode. You ENCODE the other way. Remember that Latin1, UTF8 etc are called "encodings".

Lennart Regebro 2009-06-29 14:11:18

Answer 2

+3 A:

As noted by Lennart, your problem is not the decoding. It is trying to encode into "ascii", which is often a problem with print statements. I suspect the line

print str

is your problem. You need to encode the str into whatever your console is using to have that line work.

Kathy Van Stone 2009-06-29 14:21:03

Answer 3

+2 A:

It doesn't look like Python is "reading it in UTF-8" at all. As already pointed out, you have an encoding problem, NOT a decoding problem. It is impossible for that error to have arisen from that line that you say. When asking a question like this, always give the full traceback and error message.

Kathy's suspicion is correct; in fact the print str line is the only possible source of that error, and that can only happen when sys.stdout.encoding is not set so Python punts on 'ascii'.

Variables that may affect the outcome are what version of Python you are using, what platform you are running on and exactly how you run your script -- none of which you have told us; please do.

Example: I'm using Python 2.6.2 on Windows XP and I'm running your script with some diagnostic additions: (1) import sys; print sys.stdout.encoding up near the front (2) print repr(str) before print str so that I can see what you've got before it crashes.

In a Command Prompt window, if I do \python26\python hockey.py it prints cp850 as the encoding and just works.

However if I do

\python26\python hockey.py | more

or

\python26\python hockey.py >hockey.txt

it prints None as the encoding and crashes with your error message on the first line with the a-with-diaeresis:

C:\junk>\python26\python hockey.py >hockey.txt
Traceback (most recent call last):
  File "hockey.py", line 18, in <module>
    print str
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128)

If that fits your case, the fix in general is to explicitly encode your output with an encoding suited to the display mechanism you plan to use.

John Machin 2009-06-29 15:42:05

"the fix in general is to explicitly encode your output with an encoding suited to the display mechanism you plan to use."... and for debugging output, always use repr() on strings.

user9876 2009-06-29 15:47:47

for debugging anything, use repr() on whatever the concern is, except on 3.x where repr() has been renamed ascii() -- the new repr() is not quite so useful when discussing encoding/decoding problems across the net ;-(

John Machin 2009-06-29 16:02:36

Makis 2009-06-30 07:43:49

READ MY LIPS:"""Variables that may affect the outcome are what version of Python you are using, what platform you are running on and exactly how you run your script -- none of which you have told us; please do."""

John Machin 2009-06-30 08:12:51

MORE: previously at the point where you did "print str", you got a message that was consistent with str being Unicode. Now, you are getting a message that indicates that str has been encoded from Unicode into an 8-bit string BEFORE you "try to encode and print". What else have you changed? It would be better if you showed us the whole script again (edit your question). Also when you are asked to use print repr(), please don't trim the u' or ' off the front of the line; in police jargon, this is called "tampering with the evidence" :-)

John Machin 2009-06-30 15:17:02

ansaurus

tags:

views:

answers:

Unicode problems with web pages in Python's urllib

related questions