ansaurus

Question

How do I get str.translate to work with Unicode strings?

Answer 1

+5 A:

The Unicode version of translate requires a mapping from Unicode ordinals (which you can retrieve for a single character with ord) to Unicode ordinals. If you want to delete characters, you map to None.

I changed your function to build a dict mapping the ordinal of every character to the ordinal of what you want to translate to:

def translate_non_alphanumerics(to_translate, translate_to=u'_'):
    not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
    translate_table = dict((ord(char), translate_to) for char in not_letters_or_digits)
    return to_translate.translate(translate_table)

>>> translate_non_alphanumerics(u'<foo>!')
u'_foo__'

edit: It turns out that the translation mapping must map from the Unicode ordinal (via ord) to either another Unicode ordinal, a Unicode string, or None (to delete). I have thus changed the default value for translate_to to be a Unicode literal. For example:

>>> translate_non_alphanumerics(u'<foo>!', u'bad')
u'badfoobadbad'

Mike Boers 2009-08-24 19:02:57

Answer 2

+1 A:

I came up with the following combination of my original function and Mike's version that works with Unicode and ASCII strings:

def translate_non_alphanumerics(to_translate, translate_to=u'_'):
    not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
    if isinstance(to_translate, unicode):
        translate_table = dict((ord(char), unicode(translate_to))
                               for char in not_letters_or_digits)
    else:
        assert isinstance(to_translate, str)
        translate_table = string.maketrans(not_letters_or_digits,
                                           translate_to
                                              *len(not_letters_or_digits))
    return to_translate.translate(translate_table)

Update: "coerced" translate_to to unicode for the unicode translate_table. Thanks Mike.

Daryl Spitzer 2009-08-24 19:33:25

I would suggest that you coerce the translate_to into Unicode for the Unicode version, otherwise the translate call will freak out if you pass it a Unicode string, and "normal" string.

Mike Boers 2009-08-24 19:40:05

This seems like something that should be part of the language. +1

bukzor 2010-04-24 16:39:33

Answer 3

A:

For a simple hack that will work on both str and unicode objects, convert the translation table to unicode before running translate():

import string
def translate_non_alphanumerics(to_translate, translate_to='_'):
    not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
    translate_table = string.maketrans(not_letters_or_digits,
                                       translate_to
                                         *len(not_letters_or_digits))
    translate_table = translate_table.decode("latin-1")
    return to_translate.translate(translate_table)

The catch here is that it will implicitly convert all str objects to unicode, throwing errors if to_translate contains non-ascii characters.

eswald 2009-08-24 20:07:33

ansaurus

tags:

views:

answers:

How do I get str.translate to work with Unicode strings?

related questions