ansaurus

Question

How to implement Unicode string matching by folding in python

Answer 1

+1 A:

Have a look at this: ftp://alan.smcvt.edu/hefferon/unicode2ascii.py

Probably not complete, but might get you started.

stefanw 2009-09-11 11:28:33

The script is good inspiration, but I actually don't want to turn it into pure ascii (anymore). Check the updated question for what I use now.

kaizer.se 2009-09-12 19:27:22

Answer 2

+2 A:

You can use this strip_accents function to remove the accents:

def strip_accents(s):
   return ''.join((c for c in unicodedata.normalize('NFD', unicode(s)) if unicodedata.category(c) != 'Mn'))

>>> strip_accents(u'Östblocket')
'Ostblocket'

JG 2009-09-11 11:31:30

This leaves unhandled characters in the string, instead of stripping them, which is good. 'Dźwięk -> Dzwiek' but 'Wyłącz -> Wyłacz'; then I can add special cases on top of that

kaizer.se 2009-09-11 11:56:56

I accept this answer since Filtering the normalization goes almost all the way. Check the updated question for what I use.

kaizer.se 2009-09-12 19:26:42

However, note that you want to filter the NF**K**D; this is a decomposition without strict equivalence, for example subscript characters are converted to plain numerals.

kaizer.se 2009-09-12 19:50:40

Answer 3

+1 A:

A general purpose solution (especially for search normalization and generating slugs) is the unidecode module:

http://pypi.python.org/pypi/Unidecode

It's a port of the Text::Unidecode module for Perl. It's not complete, but it translates all Latin-derived characters I could find, transliterates Cyrillic, Chinese, etc to Latin and even handles full-width characters correctly.

It's probably a good idea to simply strip all characters you don't want to have in the final output or replace them with a filler (e.g. "äßœ$" will be unidecoded to "assoe$", so you might want to strip the non-alphanumerics). For characters it will transliterate but shouldn't (say, §=>SS and €=>EU) you need to clean up the input:

input_str = u'äßœ$'
input_str = u''.join([ch if ch.isalnum() else u'-' for ch in input_str])
input_str = str(unidecode(input_str)).lower()

This would replace all non-alphanumeric characters with a dummy replacement and then transliterate the string and turn it into lowercase.

Alan 2010-02-24 17:18:10

Thank you for this tip. My application is multilingual, and I'm not sure if I want to tip it too far into strict ascii aliasing. My current solution has different but cool features, like japanese script 'ヘ' will match 'ペ' just like 'a' will match 'ä'; the diacritic-stripping is more universal, although I haven't established if this is useful or not (I only have feedback from western users).

kaizer.se 2010-03-02 16:27:19

Answer 4

A:

What about this one:

normalize('NFKD', unicode_string).encode('ASCII', 'ignore').lower()

Taken from here (Spanish) http://python.org.ar/pyar/Recetario/NormalizarCaracteresUnicode

Esteban Feldman 2010-05-03 16:20:21

For my application, I already addressed this in a different comment: I want to have a *unicode* result and *leave unhandled characters untouched*.

kaizer.se 2010-05-09 11:37:56

ansaurus

tags:

views:

answers:

How to implement Unicode string matching by folding in python

related questions