tags:

views:

100

answers:

3

Is there any lib that can replace national characters to ASCII equivalents, like:

"Cześć"

to:

"Czesc"

I can of course create map:

{'ś':'s', 'ć': 'c'}

and use some replace function. But I don't want to hardcode all equivalents into my program, if there is some function that already does that.

+3  A: 
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import unicodedata
text = u'Cześć'
print unicodedata.normalize('NFD', text).encode('ascii', 'ignore')
nosklo
'NFKD' would give you ASCII output more often than 'NFD' would.
dan04
+2  A: 

You can get most of the way by doing:

import unicodedata

def strip_accents(text):
    return ''.join(c for c in unicodedata.normalize('NFKD', text) if unicodedata.category(c) != 'Mn')

Unfortunately, there exist accented Latin letters that cannot be decomposed into an ASCII letter + combining marks. You'll have to handle them manually. These include:

  • Æ → AE
  • Ð → D
  • Ø → O
  • Þ → TH
  • ß → ss
  • æ → ae
  • ð → d
  • ø → o
  • þ → th
  • Œ → OE
  • œ → oe
  • ƒ → f
dan04
+1  A: 

The unicodedata.normalize gimmick can best be described as half-assci. Here is a robust approach which includes a map for letters with no decomposition. Note the additional map entries in the comments.

John Machin