ansaurus

Question

What is the best way to remove accents in a python unicode string?

Answer 1

+2 A:

Maybe this?

Stephen Pape 2009-02-05 21:15:15

Answer 2

+7 A:

I just found this answer on the Web:

import unicodedata

def remove_accents(str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(str))
    only_ascii = nkfd_form.encode('ASCII', 'ignore')
    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

Edit: this does the trick:

import unicodedata

def remove_accents(str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)]

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.

MiniQuark 2009-02-05 21:19:34

Answer 3

+14 A:

How about this:

import unicodedata
def strip_accents(s):
   return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>

Update:

The character category "Mn" stands for "Mark, Nonspacing", which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

oefe 2009-02-05 22:17:22

Cool, this seems to work, thanks.Could you explain (in your answer) what the Mn category is, please?

MiniQuark 2009-02-06 09:01:43

stop butchering our languages! in german the "non-accented counterpart" to ä is ae, _not_ a.

hop 2009-02-06 16:43:28

@hop: I'm french: I know exactly what you mean. For example "salé" means "salty", and "sale" means "dirty". Accents are useful. I just wish I could keep them, but unfortunately they are still refused in a lot of software, so sometimes you just cannot avoid removing them. In fact the goal of this question is precisely to try to find the best way to remove accents, while "butchering" as little as possible. So if you have a better solution than the one suggested here, I would be more than happy to use it.

MiniQuark 2009-05-27 14:07:00

+1 for Accents, Umlauts etc. are not "decoration".

kaizer.se 2009-09-11 11:55:00

Answer 4

+10 A:

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Christian Oudard 2010-04-13 21:21:14

Yeah, this is a better solution than simply stripping the accents. It provides much more useful transliterations for the languages that have conventions for writing words in ASCII.

Paul McMillan 2010-04-13 21:29:24

Answer 5

A:

This is quite a good solution. Written in JavaScript but easily ported: http://semplicewebsites.com/removing-accents-javascript

Ed 2010-06-29 15:50:21

ansaurus

tags:

views:

answers:

What is the best way to remove accents in a python unicode string?

related questions