tags:

views:

456

answers:

4

I have an application implementing incremental search. I have a catalog of unicode strings to be matched and match them to a given "key" string; a catalog string is a "hit" if it contains all of the characters in the key, in order, and it is ranked better if the key characters cluster in the catalog string.

Anyway, this works fine and matches unicode exactly, so that "öst" will match "Östblocket" or "r**öst**" or "r**ö**d sten".

Anyway, now I want to implement folding, since there are some cases where it is not useful to distinguish between a catalog character such as "á" or "é" and the key character "a" or "e".

For example: "Ole" should match "Olé"

How do I best implement this unicode-folding matcher in Python? Efficiency is important since I have to match thousands of catalog strings to the short, given key.

It does not have to turn it into ascii; in fact, the algorithm's output string could be unicode. Leaving a character in is better than stripping it.


I don't know which answer to accept, since I use a bit of both. Taking the NKFD decomposition and removing combining marks goes almost the way to finish, I only add some custom transliterations to that. Here is the module as it looks now: (Warning, contains unicode chars inline, since it is much nicer to edit that way.)

# -*- encoding: UTF-8 -*-

import unicodedata
from unicodedata import normalize, category

def _folditems():
    _folding_table = {
     # general non-decomposing characters
     # FIXME: This is not complete
     u"ł" : u"l",
     u"œ" : u"oe",
     u"ð" : u"d",
     u"þ" : u"th",
     u"ß" : u"ss",
     # germano-scandinavic canonical transliterations
     u"ü" : u"ue",
     u"å" : u"aa",
     u"ä" : u"ae",
     u"æ" : u"ae",
     u"ö" : u"oe",
     u"ø" : u"oe",
    }

    for c, rep in _folding_table.iteritems():
     yield (ord(c.upper()), rep.title())
     yield (ord(c), rep)

folding_table = dict(_folditems())

def tofolded(ustr):
    u"""Fold @ustr

    Return a unicode str where composed characters are replaced by
    their base, and extended latin characters are replaced by
    similar basic latin characters.

    >>> tofolded(u"Wyłącz")
    u'Wylacz'
    >>> tofolded(u"naïveté")
    u'naivete'

    Characters from other scripts are not transliterated.

    >>> tofolded(u"Ἑλλάς") == u'Ελλας'
    True

    (These doctests pass, but should they fail, they fail hard)
    """
    srcstr = normalize("NFKD", ustr.translate(folding_table))
    return u"".join(c for c in srcstr if category(c) != 'Mn')

if __name__ == '__main__':
    import doctest
    doctest.testmod()

(And, for the actual matching if that interests anyone: I construct folded strings for all my catalog beforehand, and put the folded versions into the already-available catalog object alias property.)

+1  A: 

Have a look at this: ftp://alan.smcvt.edu/hefferon/unicode2ascii.py

Probably not complete, but might get you started.

stefanw
The script is good inspiration, but I actually don't want to turn it into pure ascii (anymore). Check the updated question for what I use now.
kaizer.se
+2  A: 

You can use this strip_accents function to remove the accents:

def strip_accents(s):
   return ''.join((c for c in unicodedata.normalize('NFD', unicode(s)) if unicodedata.category(c) != 'Mn'))

>>> strip_accents(u'Östblocket')
'Ostblocket'
JG
This leaves unhandled characters in the string, instead of stripping them, which is good. 'Dźwięk -> Dzwiek' but 'Wyłącz -> Wyłacz'; then I can add special cases on top of that
kaizer.se
I accept this answer since Filtering the normalization goes almost all the way. Check the updated question for what I use.
kaizer.se
However, note that you want to filter the NF**K**D; this is a decomposition without strict equivalence, for example subscript characters are converted to plain numerals.
kaizer.se
+1  A: 

A general purpose solution (especially for search normalization and generating slugs) is the unidecode module:

http://pypi.python.org/pypi/Unidecode

It's a port of the Text::Unidecode module for Perl. It's not complete, but it translates all Latin-derived characters I could find, transliterates Cyrillic, Chinese, etc to Latin and even handles full-width characters correctly.

It's probably a good idea to simply strip all characters you don't want to have in the final output or replace them with a filler (e.g. "äßœ$" will be unidecoded to "assoe$", so you might want to strip the non-alphanumerics). For characters it will transliterate but shouldn't (say, §=>SS and =>EU) you need to clean up the input:

input_str = u'äßœ$'
input_str = u''.join([ch if ch.isalnum() else u'-' for ch in input_str])
input_str = str(unidecode(input_str)).lower()

This would replace all non-alphanumeric characters with a dummy replacement and then transliterate the string and turn it into lowercase.

Alan
Thank you for this tip. My application is multilingual, and I'm not sure if I want to tip it too far into strict ascii aliasing. My current solution has different but cool features, like japanese script 'ヘ' will match 'ペ' just like 'a' will match 'ä'; the diacritic-stripping is more universal, although I haven't established if this is useful or not (I only have feedback from western users).
kaizer.se
A: 

What about this one:

normalize('NFKD', unicode_string).encode('ASCII', 'ignore').lower()

Taken from here (Spanish) http://python.org.ar/pyar/Recetario/NormalizarCaracteresUnicode

Esteban Feldman
For my application, I already addressed this in a different comment: I want to have a *unicode* result and *leave unhandled characters untouched*.
kaizer.se