views:

461

answers:

2

I have this code in Google AppEngine (Python SDK):

from string import maketrans 

intab =  u"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ".encode('latin1') 
outtab = u"aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn".encode('latin1') 
logging.info(len(intab))
logging.info(len(outtab))
trantab = maketrans(intab, outtab)

When I run the code in the interactive console I have no problem, but when I try it in GAE I get the following error:

raise ValueError, "maketrans arguments must have same length" ValueError: maketrans arguments must have same length INFO 2009-12-03 20:04:02,904 dev_appserver.py:3038] "POST /backendsavenew HTTP/1.1" 500 - INFO 2009-12-03 20:08:37,649 admin.py:112] 106 INFO 2009-12-03 20:08:37,651 admin.py:113] 53 ERROR 2009-12-03 20:08:37,653 init.py:388] maketrans arguments must have same length

I can't figure out why the intab it's doubled in size. The python file with the code is saved as UTF-8.

Thanks in advance for any help.

+6  A: 

string.maketrans and string.translate do not work for Unicode strings. Your call to string.maketrans will implictly convert the Unicode you gave it to an encoding like utf-8. In utf-8 å takes up more space than ASCII a. string.maketrans sees len(str(argument)) which is different for your two strings.

There is a Unicode translate, but for your use case (convert Unicode to ASCII because some part of your system cannot deal with Unicode) you should use http://pypi.python.org/pypi/Unidecode. Unidecode is very smart about transliterating Unicode characters to sensible ASCII, covering many more characters than in your example.

You should save your Python code as utf-8, but make sure you add the magic so Python doesn't have to assume you used the system's default encoding. This line should be the first or second line of your Python files:

# -*- coding: utf-8 -*-

There are many advantages to processing text as Unicode instead of binary strings. This is the Unicode way to do what you are trying to do:

intab =  u"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ"
outtab = u"aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn"
trantab = dict((ord(a), b) for a, b in zip(intab, outtab))
translated = intab.translate(trantab)
translated == outtab # True

See also http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database

See also http://stackoverflow.com/questions/1324067/how-do-i-get-str-translate-to-work-with-unicode-strings

joeforker
I'd rather not add a new package to GAE to solve it, but I'll look into the code of Unidecode. Thanks.
Chedar
# -*- coding: utf-8 -*-solved it. Thanks.
Chedar
I also needed to delete some characters. I changed the code to "the Unicode way" and added some conversions to None in the dictionary.
Chedar
+1  A: 

Maybe you could use iso-8859-1 encoding for your file instead of utf-8

# -*- coding: iso-8859-1 -*-
from string import maketrans 
import logging

intab =  "ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ"
outtab = "aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn"
logging.info(len(intab))
logging.info(len(outtab))
trantab = maketrans(intab, outtab)

Remember to select iso-8859-1 in your text editor while saving this python source file.

Kamil Szot
I wonder why I got downvoted. The code runs on google appengine (I just tested) and does what it supposed to.
Kamil Szot
Unfortunately the question is wrong. The right solution is to do text processing in Unicode and avoid `str.translate` altogether.
joeforker