Handling different non-accented versions of Umlaut characters

The German accented Umlaut characters “ö”, “ä” and “ü” are often replaced with non-accented versions when users type, often for convenience when they do not have the correct keyboard.

With most accented characters there is a particular non-accented version that most people use. The accented “è”, for instance, is always replaced with a standard “e”.

With the Umlaut characters there appears to be a difference between the convention adopted by our British and our American users.

British users will replace them with “o”, “a” and “u” respectively, where as...
American users will replace them with “oe”, “ae” and “ue” respectively.

Our search is built on Lucene.Net, and like with any search framework, the technique used to match all combinations of accented characters is to replace them, both when the index is created and when the search criteria is supplied, therefore allowing the matching to be done with purely non-accented characters.

How would I parse the accented characters in order to support the following...

A German customer types – “Götz”
A British customer types – “Gotz”
An American customer types “Goetz”

Given that the name is in our database in its correct form of “Götz”, then how would I parse “Götz” so that all three of the users can find it in the index?

EDIT

I found this article on CodeProject that was exactly what I was looking for. The example shows how Synonyms for words can also be added to the Lucene index so that they are matched as well as the original word. With a small adaptation I was able to do exactly what I wanted.

ansaurus

tags:

views:

answers:

Handling different non-accented versions of Umlaut characters

related questions