views:

4904

answers:

5

I have a unicode string in python, and I would like to remove all the accents (diacritics).

I found on the Web an elegant way to do this in Java:

  1. convert the unicode string to its long normalized form (with a separate character for letters and diacritics)
  2. remove all the characters whose unicode type is "diacritic".

Do I need to install a library such as pyICU or is this possible with just the python standard library? And what about in python 3.0?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

Thanks for your help.

+2  A: 

Maybe this?

Stephen Pape
+7  A: 

I just found this answer on the Web:

import unicodedata

def remove_accents(str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(str))
    only_ascii = nkfd_form.encode('ASCII', 'ignore')
    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

Edit: this does the trick:

import unicodedata

def remove_accents(str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)]

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.

MiniQuark
+14  A: 

How about this:

import unicodedata
def strip_accents(s):
   return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>

Update:

The character category "Mn" stands for "Mark, Nonspacing", which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

oefe
Cool, this seems to work, thanks.Could you explain (in your answer) what the Mn category is, please?
MiniQuark
stop butchering our languages! in german the "non-accented counterpart" to ä is ae, _not_ a.
hop
@hop: I'm french: I know exactly what you mean. For example "salé" means "salty", and "sale" means "dirty". Accents are useful. I just wish I could keep them, but unfortunately they are still refused in a lot of software, so sometimes you just cannot avoid removing them. In fact the goal of this question is precisely to try to find the best way to remove accents, while "butchering" as little as possible. So if you have a better solution than the one suggested here, I would be more than happy to use it.
MiniQuark
+1 for Accents, Umlauts etc. are not "decoration".
kaizer.se
+10  A: 

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Christian Oudard
Yeah, this is a better solution than simply stripping the accents. It provides much more useful transliterations for the languages that have conventions for writing words in ASCII.
Paul McMillan
A: 

This is quite a good solution. Written in JavaScript but easily ported: http://semplicewebsites.com/removing-accents-javascript

Ed