I have an application implementing incremental search. I have a catalog of unicode strings to be matched and match them to a given "key" string; a catalog string is a "hit" if it contains all of the characters in the key, in order, and it is ranked better if the key characters cluster in the catalog string.
Anyway, this works fine and matches unicode exactly, so that "öst" will match "Östblocket" or "r**öst**" or "r**ö**d sten".
Anyway, now I want to implement folding, since there are some cases where it is not useful to distinguish between a catalog character such as "á" or "é" and the key character "a" or "e".
For example: "Ole" should match "Olé"
How do I best implement this unicode-folding matcher in Python? Efficiency is important since I have to match thousands of catalog strings to the short, given key.
It does not have to turn it into ascii; in fact, the algorithm's output string could be unicode. Leaving a character in is better than stripping it.
I don't know which answer to accept, since I use a bit of both. Taking the NKFD decomposition and removing combining marks goes almost the way to finish, I only add some custom transliterations to that. Here is the module as it looks now: (Warning, contains unicode chars inline, since it is much nicer to edit that way.)
# -*- encoding: UTF-8 -*-
import unicodedata
from unicodedata import normalize, category
def _folditems():
_folding_table = {
# general non-decomposing characters
# FIXME: This is not complete
u"ł" : u"l",
u"œ" : u"oe",
u"ð" : u"d",
u"þ" : u"th",
u"ß" : u"ss",
# germano-scandinavic canonical transliterations
u"ü" : u"ue",
u"å" : u"aa",
u"ä" : u"ae",
u"æ" : u"ae",
u"ö" : u"oe",
u"ø" : u"oe",
}
for c, rep in _folding_table.iteritems():
yield (ord(c.upper()), rep.title())
yield (ord(c), rep)
folding_table = dict(_folditems())
def tofolded(ustr):
u"""Fold @ustr
Return a unicode str where composed characters are replaced by
their base, and extended latin characters are replaced by
similar basic latin characters.
>>> tofolded(u"Wyłącz")
u'Wylacz'
>>> tofolded(u"naïveté")
u'naivete'
Characters from other scripts are not transliterated.
>>> tofolded(u"Ἑλλάς") == u'Ελλας'
True
(These doctests pass, but should they fail, they fail hard)
"""
srcstr = normalize("NFKD", ustr.translate(folding_table))
return u"".join(c for c in srcstr if category(c) != 'Mn')
if __name__ == '__main__':
import doctest
doctest.testmod()
(And, for the actual matching if that interests anyone: I construct folded strings for all my catalog beforehand, and put the folded versions into the already-available catalog object alias property.)