ansaurus

Question

Latin-1 and the unicode factory in Python

Answer 1

+2 A:

Add this at the beginning of the module:

# coding: latin1

Or decode the string to Unicode yourself.

[Edit]

It's been a while since I played with Unicode, but hopefully this example will show how to convert from Latin1 to Unicode:

>>> s = u'ééé'.encode('latin1') # a string you may get from the database
>>> s.decode('latin1')
u'\xe9\xe9\xe9'

[Edit]

Documentation:
http://docs.python.org/howto/unicode.html
http://docs.python.org/library/codecs.html

Bastien Léonard 2009-07-20 20:59:20

I've tried putting the coding at the top of my scripts, but that still doesn't work. I'll try the explicit decoding, but I hope there's a more general solution.

eksortso 2009-07-20 21:06:48

You probably don't want to set coding: latin1. That changes the encoding of the script's source, not its data.

Glenn Maynard 2009-07-20 21:13:51

@Glenn: I suggested that because I thought that `print t` may print Latin1 raw strings.

Bastien Léonard 2009-07-20 21:16:14

My script doesn't have any string literals with non-ASCII characters. So that shouldn't be a factor.

eksortso 2009-07-20 21:25:27

@Bastien Léonard: The data that I get from the database is not Unicode. The character in question is 0xed.I can't decode 0xed with the ascii codec. Is there a way to change the default coding just for this one function, so that `unicode` would work?

eksortso 2009-07-20 21:50:20

@Bastien Léonard: I got a combination that finally worked. I'll tack it to the end of my question. Thanks for the help.

eksortso 2009-07-20 21:56:19

Answer 2

+1 A:

Maybe try to decode the latin1-encoded strings into unicode?

t.add_row((value.decode('latin1') for value in rec))

liori 2009-07-20 21:09:36

t.add_row([s.decode('latin-1') if isinstance(s, str) else s for s in rec]) # I think you meant this (or something like it).

eksortso 2009-07-20 21:54:48

Probably, depending on what that prettytable thingy need.

liori 2009-07-20 23:30:19

Answer 3

A:

After a quick peek at the source for PrettyTable, it appears that it works on unicode objects internally (see _stringify_row, add_row and add_column, for example). Since it doesn't know what encoding your input strings are using, it uses the default encoding, usually ascii.

Now ascii is a subset of latin-1, which means if you're converting from ascii to latin-1, you shouldn't have any problems. The reverse however, isn't true; not all latin-1 characters map to ascii characters. To demonstrate this:

>>> s = u'\xed\x31\x32\x33'
>>> print s
# FAILS: Python calls "s.decode('ascii')", but ascii codec can't decode '\xed'
>>> print s.decode('ascii')
# FAILS: Same as above
>>> print s.decode('latin-1')
í123

Explicitly converting the strings to unicode (like you eventually did) fixes things, and makes more sense, IMO -- you're more likely to know what charset your data is using, than the author of PrettyTable :). ~~BTW, you can omit the check for strings in your list comprehension by replacing s.decode('latin-1') with unicode(s, 'latin-1') since all objects can be coerced to strings~~.

One last thing: don't forget to check the character set of your database and tables -- you don't want to assume 'latin-1' in code, when the data is actually being stored as something else ('utf-8'?) in the database. In MySQL, you can use the SHOW CREATE TABLE <table_name> command to find out what character set a table is using, and SHOW CREATE DATABASE <db_name> to do the same for a database.

elo80ka 2009-07-20 23:45:32

If it's using Unicode objects internally, there should be a way to get Unicode objects back and avoid pointlessly converting them back and forth. As long as you always use Unicode objects you avoid most of this mess (that's how Python 3 always works).

Glenn Maynard 2009-07-21 02:45:47

I believe it does: "get_string".

elo80ka 2009-07-21 09:19:26

@elo80ka, the data really is stored as Latin-1. I verified that before writing the script. Also (in Python 2.6 at least), ints cannot be coerced using `unicode(int_value, 'latin-1')`, even though `unicode(int_value)` works.@Glenn Maynard, printing the results involves a decode, explicitly defined or not. I had to use `t.get_string().encode('latin-1')`Yeah, I'm looking forward to py3k's widespread adoption, so that all strings are Unicode. It would save a lot of hassle.

eksortso 2009-07-21 18:42:30

@eksortso: you're right...character sets don't really make sense for numbers anyway.

elo80ka 2009-07-21 20:42:27

ansaurus

tags:

views:

answers:

Latin-1 and the unicode factory in Python

related questions