I seem to have the all-familiar problem of correctly reading and viewing a web page. It looks like Python reads the page in UTF-8 but when I try to convert it to something more viewable (iso-8859-1) I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128)
The code looks like this:
#!/usr/bin/python
from urllib import urlopen
import re
url_address = 'http://www.eurohockey.net/players/show_player.cgi?serial=4722'
finished = 0
begin_record = 0
col = 0
str = ''
for line in urlopen(url_address):
if '</tr' in line:
begin_record = 0
print str
str = ''
continue
if begin_record == 1:
col = col + 1
tmp_match = re.search('<td>(.+)</td>', line.strip())
str = str + ';' + unicode(tmp_match.group(1), 'iso-8859-1')
if '<tr class=\"even\"' in line or '<tr class=\"odd\"' in line:
begin_record = 1
col = 0
continue
How should I handle the contents? Firefox at least thinks it's iso-8859-1 and it would make sense looking at the contents of that page. The error comes from the 'ä' character clearly.
And if I was to save that data to a database, should I not bother with changing the codec and then converting when showing it?