views:

32

answers:

1

I have some fun with unicode text sources (all correct encodet) and I want to match names. The classic problem, one source comes correctly, an other has more flatten names:

"Elbląg" vs. "Elblag" (see the character a)

How can I "flatten" ą, á, â or à to a for better matching? Are there unicode to ascii- matching tables?

+1  A: 

Try

>>> unicodedata.normalize('NFKD', u'Elbląg').encode('ascii', 'ignore')
'Elblag'
KennyTM
Thanx - all the polish names are matching!
Christian Harms
Be aware that not all accented Latin letters have an NFKD decomposition as ASCII letter + combining mark. For example, 'ø' does not decompose.
dan04