tags:

views:

604

answers:

3

I seem to have the all-familiar problem of correctly reading and viewing a web page. It looks like Python reads the page in UTF-8 but when I try to convert it to something more viewable (iso-8859-1) I get this error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128)

The code looks like this:

#!/usr/bin/python
from urllib import urlopen
import re

url_address = 'http://www.eurohockey.net/players/show_player.cgi?serial=4722'

finished = 0
begin_record = 0
col = 0
str = ''

for line in urlopen(url_address):
    if '</tr' in line:
        begin_record = 0                   
        print str
        str = ''
        continue

    if begin_record == 1:
        col = col + 1
        tmp_match =  re.search('<td>(.+)</td>', line.strip())
        str = str + ';' + unicode(tmp_match.group(1), 'iso-8859-1')

    if '<tr class=\"even\"' in line or '<tr class=\"odd\"' in line: 
        begin_record = 1
        col = 0
        continue

How should I handle the contents? Firefox at least thinks it's iso-8859-1 and it would make sense looking at the contents of that page. The error comes from the 'ä' character clearly.

And if I was to save that data to a database, should I not bother with changing the codec and then converting when showing it?

+1  A: 

That text is indeed iso-88591-1, and I can decode it without a problem, and indeed your code runs without a hitch.

Your error, however, is an ENCODE error, not a decode error. And you don't do any encoding in your code, so. Possibly you have gotten encoding and decoding confused, it's a common problem.

You DECODE from Latin1 to Unicode. You ENCODE the other way. Remember that Latin1, UTF8 etc are called "encodings".

Lennart Regebro
+3  A: 

As noted by Lennart, your problem is not the decoding. It is trying to encode into "ascii", which is often a problem with print statements. I suspect the line

print str

is your problem. You need to encode the str into whatever your console is using to have that line work.

Kathy Van Stone
+2  A: 

It doesn't look like Python is "reading it in UTF-8" at all. As already pointed out, you have an encoding problem, NOT a decoding problem. It is impossible for that error to have arisen from that line that you say. When asking a question like this, always give the full traceback and error message.

Kathy's suspicion is correct; in fact the print str line is the only possible source of that error, and that can only happen when sys.stdout.encoding is not set so Python punts on 'ascii'.

Variables that may affect the outcome are what version of Python you are using, what platform you are running on and exactly how you run your script -- none of which you have told us; please do.

Example: I'm using Python 2.6.2 on Windows XP and I'm running your script with some diagnostic additions: (1) import sys; print sys.stdout.encoding up near the front (2) print repr(str) before print str so that I can see what you've got before it crashes.

In a Command Prompt window, if I do \python26\python hockey.py it prints cp850 as the encoding and just works.

However if I do

\python26\python hockey.py | more

or

\python26\python hockey.py >hockey.txt

it prints None as the encoding and crashes with your error message on the first line with the a-with-diaeresis:

C:\junk>\python26\python hockey.py >hockey.txt
Traceback (most recent call last):
  File "hockey.py", line 18, in <module>
    print str
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128)

If that fits your case, the fix in general is to explicitly encode your output with an encoding suited to the display mechanism you plan to use.

John Machin
"the fix in general is to explicitly encode your output with an encoding suited to the display mechanism you plan to use."... and for debugging output, always use repr() on strings.
user9876
for debugging anything, use repr() on whatever the concern is, except on 3.x where repr() has been renamed ascii() -- the new repr() is not quite so useful when discussing encoding/decoding problems across the net ;-(
John Machin
Makis
READ MY LIPS:"""Variables that may affect the outcome are what version of Python you are using, what platform you are running on and exactly how you run your script -- none of which you have told us; please do."""
John Machin
MORE: previously at the point where you did "print str", you got a message that was consistent with str being Unicode. Now, you are getting a message that indicates that str has been encoded from Unicode into an 8-bit string BEFORE you "try to encode and print". What else have you changed? It would be better if you showed us the whole script again (edit your question). Also when you are asked to use print repr(), please don't trim the u' or ' off the front of the line; in police jargon, this is called "tampering with the evidence" :-)
John Machin