Removing diacritic symbols from UTF8 string in C

tags:

c
utf-8

views:

answers:

+1 Q:

Removing diacritic symbols from UTF8 string in C

Hi all,

I am writing a C program to search a large number of UTF-8 strings in a database. Some of these strings contain English characters with didactics, such as accents, etc. The search string is entered by the user, so it will most likely not contain such characters. Is there a way (function, library, etc) which can remove these characters from a string, or just perform a didactic-insensitive search? For example, if the user enters the search string "motor", it should match the string "motörhead".

My first attempt was to manually strip out the combining didactic modifiers described here:

http://en.wikipedia.org/wiki/Combining_character

This worked in some cases, but it turns out many of these characters also have specific unicode values. For example, the character "ö" above can be represented by an "o" followed by the combining didactic U+0308, but it can also be represented by the single unicode character U+00F6, and my method only filters the former.

I have also looked into iconv, which can convert from UTF8 to ASCII. However, I may want to localize my program at a future date, and this would no doubt cause problems for languages with non-English characters. Is there a way I can simply strip/convert these accented characters?

Edit: removed typo in question title.

+1 A:

Seems to be a duplicate of this question.

kriss 2010-10-25 15:08:17

-1 this is a comment

pmg 2010-10-25 15:23:54

@pmg: that is also a link to the answers.

kriss 2010-10-25 15:26:44

@pmg: I mean, looking for a duplicate I remembered and pointing to it took me some time, this time is repaid by negative rep. The only reason is because I clicked on *Add another Answer* instead of *add comment* (what I usually do) for the question. That goes the opposite way of giving an incentive for finding duplicate http://meta.stackoverflow.com/questions/37466/give-an-incentive-for-finding-duplicate-questions

kriss 2010-10-25 15:44:15

+4 A:

Convert to one of the decomposed normalizations -- probably NFD, but you might want NFKD even -- that makes all diacritics into combining characters that can be stripped.

You will want a library for this. I hear good things about ICU.

Zack 2010-10-25 15:15:07

+1 A:

Use ICU, create a collator over "root" with strength of PRIMARY (L1) (which only uses base letters, only cares about 'o' and ignores 'ö') then you can use ICU's search functions to match. There's a new functionality search collator that will provide special collators designed for this case, but 'primary strength' will handle this specific case.

Example: "motor == mötor" in the 'collated' section.

Steven R. Loomis 2010-10-25 17:36:54

ansaurus

tags:

views:

answers:

Removing diacritic symbols from UTF8 string in C

related questions