ansaurus

Question

latin-1 to ascii

Answer 1

A:

Hi,

Check this: http://bradmontgomery.blogspot.com/2009/01/how-to-convert-non-ascii-to-ascii-in.html

yoda 2009-09-05 10:49:03

As you have to do this multiple times (once per character to replace), I doubt it's fast.

Martin v. Löwis 2009-09-05 10:50:28

Answer 2

A:

Without measuring, I would expect that the .translate method of Unicode strings is the fastest solution. You should definitely make your own measurements, though.

Martin v. Löwis 2009-09-05 10:49:24

Answer 3

+1 A:

Maketrans (and translate) then convert to ascii:

intab = u'áéí'  # extend as needed
outtab = u'aei' # as well as this one
table = maketrans(intab, outtab)

text = translate(u"Wikipédia, le projet d’encyclopédie", table)

try:
    temp = unicode(text, "utf-8")
    fixed = unicodedata.normalize('NFKD', temp).encode('ASCII', action)
    return fixed
except Exception, errorInfo:
    print errorInfo
    print "Unable to convert the Unicode characters to xml character entities"
    raise errorInfo

(from here)

Tamás Szelei 2009-09-05 10:51:27

But this will convert them to xml character entities. That's not what he asked for.

Lennart Regebro 2009-09-05 11:24:44

I do not understand first line of the solution "Maketrans (and translate) then convert to ascii:" why that is needed and you do not use it anywhere in code?

Anurag Uniyal 2009-09-05 11:41:43

@Lennart Regebro: Then it encodes them in ASCII.@Anurag Uniyal: He wanted to replace e.g. 'é' with 'e' which a plain conversion wouldn't do for him. This is why maketrans is needed. The code snippet I copied here only shows the unicode->ASCII conversion.

Tamás Szelei 2009-09-05 12:16:07

I added the maketrans example.

Tamás Szelei 2009-09-05 12:21:17

@sztomi:Thing here is that he wants non-ascii characters to be translated to ascii characters, which your example doesn't do. Your maketrans example doesn't even use unicode...

Lennart Regebro 2009-09-05 13:20:05

@Lennart Regebro: Look at this line: `fixed = unicodedata.normalize('NFKD', temp).encode('ASCII', action)`. And maketrans works with unicode as well.

Tamás Szelei 2009-09-05 13:32:24

@ sztomi but wouldn't unicodedata.normalize('NFKD', temp).encode('ASCII', 'ignore') is enought it will also convert é to e

Anurag Uniyal 2009-09-05 14:12:28

The translate method of Unicode strings doesn't work this way (and doesn't need maketrans): you're confusing it with the same-name method of `str` objects, which DOES work roughly as you say (needing `string.maketrans` to build a table).

Alex Martelli 2009-09-05 15:47:27

Ah, you mean "action" to be "'ignore'". I see. Then it works in fact. Your translate example still doesn't work though. I think you should try your own examples out.

Lennart Regebro 2009-09-05 18:08:16

Answer 4

+6 A:

The "correct" way to do this is to register your own error handler for unicode encoding/decoding, and in that error handler provide the replacements from è to e and ö to o, etc.

Like so:

# -*- coding: UTF-8 -*-
import codecs

map = {u'é': u'e',
       u'’': u"'",
       # ETC
       }

def asciify(error):
    return map[error.object[error.start]], error.end

codecs.register_error('asciify', asciify)

test = u'Wikipédia, le projet d’encyclopédie'
print test.encode('ascii', 'asciify')

You might also find something in IBM's ICU library and it's Python bindings PyICU, though, it might be less work.

Lennart Regebro 2009-09-05 11:06:43

+1: I would just add some more checking on the input for the asciify function, but I think this is also a very quick and good reference for custom error handling in unicode encoding.

Roberto Liffredo 2009-09-05 12:17:39

I agree this is the correct implementation. Maybe someone can suggest a full mapping for general-purpose use.

Jason R. Coombs 2009-09-05 13:12:01

+1 for the correct answer but I think I would be choosing the Alex's answer for being complete and that included timings.

Anurag Uniyal 2009-09-06 03:53:05

also asciify should convert all char from error.start to error.end

Anurag Uniyal 2009-09-06 05:30:25

Answer 5

+5 A:

So here are three approaches, more or less as given or suggested in other answers:

# -*- coding: utf-8 -*-
import codecs
import unicodedata

x = u"Wikipédia, le projet d’encyclopédie"

xtd = {ord(u'’'): u"'", ord(u'é'): u'e', }

def asciify(error):
    return xtd[ord(error.object[error.start])], error.end

codecs.register_error('asciify', asciify)

def ae():
  return x.encode('ascii', 'asciify')

def ud():
  return unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore')

def tr():
  return x.translate(xtd)

if __name__ == '__main__':
  print 'or:', x
  print 'ae:', ae()
  print 'ud:', ud()
  print 'tr:', tr()

Run as main, this emits:

or: Wikipédia, le projet d’encyclopédie
ae: Wikipedia, le projet d'encyclopedie
ud: Wikipedia, le projet dencyclopedie
tr: Wikipedia, le projet d'encyclopedie

showing clearly that the unicodedata-based approach, while it does have the convenience of not needing a translation map xtd, can't translate all characters properly in an automated fashion (it works for accented letters but not for the reverse-apostrophe), so it would also need some auxiliary step to deal explicitly with those (no doubt before what's now its body).

Performance is also interesting. On my laptop with Mac OS X 10.5 and system Python 2.5, quite repeatably:

$ python -mtimeit -s'import a' 'a.ae()'
100000 loops, best of 3: 7.5 usec per loop
$ python -mtimeit -s'import a' 'a.ud()'
100000 loops, best of 3: 3.66 usec per loop
$ python -mtimeit -s'import a' 'a.tr()'
10000 loops, best of 3: 21.4 usec per loop

translate is surprisingly slow (relative to the other approaches). I believe the issue is that the dict is looked into for every character in the translate case (and most are not there), but only for those few characters that ARE there with the asciify approach.

So for completeness here's "beefed-up unicodedata" approach:

specstd = {ord(u'’'): u"'", }
def specials(error):
  return specstd.get(ord(error.object[error.start]), u''), error.end
codecs.register_error('specials', specials)

def bu():
  return unicodedata.normalize('NFKD', x).encode('ASCII', 'specials')

this gives the right output, BUT:

$ python -mtimeit -s'import a' 'a.bu()'
100000 loops, best of 3: 10.7 usec per loop

...speed isn't all that good any more. So, if speed matters, it's no doubt worth the trouble of making a complete xtd translation dict and using the asciify approach. When a few extra microseconds per translation are no big deal, one might want to consider the bu approach simply for its convenience (only needs a translation dict for, hopefully few, special characters that don't translate correctly with the underlying unicodedata idea).

Alex Martelli 2009-09-05 16:34:29

Thanks for the summary and timing them :)

Anurag Uniyal 2009-09-06 03:47:10

is there a reason behind doing 'ord' while creating dict and again getting ordinal while asciifying ?

Anurag Uniyal 2009-09-06 05:42:40

@Anurag, the reason the dict's that way is to make it immediately usable in `.translate` -- `asciify` of course doesn't need that. The simplification reduces its timing, roughly, from 7.5 to 7.3 usec.

Alex Martelli 2009-09-06 16:58:01

ansaurus

tags:

views:

answers:

latin-1 to ascii

related questions