ansaurus

Question

How to change diacritic characters to non-diacritic ones

Answer 1

+4 A:

Copying from my own answer to another question:

Instead of creating your own table, you could instead convert the text to normalization form D, where the characters are represented as a base character plus the diacritics (for instance, "á" will be replaced by "a" followed by a combining acute accent). You can then strip everything which is not an ASCII letter.

The tables still exist, but are now the ones from the Unicode standard.

You could also try NFKD instead of NFD, to catch even more cases.

References:

http://unicode.org/reports/tr15/

http://blogs.msdn.com/michkap/archive/2005/02/19/376617.aspx

http://blogs.msdn.com/michkap/archive/2007/05/14/2629747.aspx

CesarB 2008-12-01 16:21:49

please don't do this, if possibly. you are butchering our languages. try to use transliteration

hop 2008-12-01 16:24:44

Answer 2

+2 A:

It might also be worthwhile to step back and consider why you want to do this. If you are trying to remove character differences you consider insignificant, you should look at the Unicode collation algorithm. This is the standard way to disregard differences such as case or diacritics when comparing strings for searching or sorting.

If you plan to display the modified text, consider your audience. What you can safely filter away is locale sensitive. In US English, "Igloo" = "igloo", and "resume" = "résumé", but in Turkish, a lower case I is ı (dotless), and in French, cote means quote, côté means side, and côte means coast. So, the collation language determines what differences are significant.

If removing diacritics is the right solution for your application, it is safest to produce your own table to which you explicitly add the characters you want to convert.

A general, automated approach could be devised using Unicode decomposition. With this, you can decompose a character with diacritics to "combining" characters (the diacritic marks) and the base character with which they are combined. Filter out any thing that is a combining character, and you should have the "non-diacritic" ones.

The lack of discrimination in the automated method, however, could have some unexpected effects. I'd recommend a lot of testing on a representative body of text.

erickson 2008-12-01 16:22:02

I think one of uses of this is to create nice URLs

tomaszs 2008-12-01 21:57:39

Answer 3

A:

For a simple example:

To remove diacritics from a string:

string newString = myDiacriticsString.Normalize(NormalizationForm.FormD);

qui 2009-01-07 15:00:29

does not work : "ě".Normalize(NormalizationForm.FormD) does not return "e"

Feryt 2010-05-19 11:05:12

Yes it does, use String.ToCharArray() to see it.

Hans Passant 2010-06-07 19:34:21

Answer 4

+3 A:

since no one has ever bothered to post the code to do this, here it is.

string RemoveDiacriticals(string text)
{
   text = text.Normalize(NormalizationForm.FormD);
   return Regex.Replace(text, @"[^\t\n\u001E-\u007F]", "");       
}

Note: a big reason for needing to do this is when you are integrating to a 3rd party system that only does ascii, but your data is in unicode. This is common. Your options are basically: remove accented characters, or attempt to remove accents from the accented characters to attempt to preserve as much as you can of the original input. Obviously, this is not a perfect solution but it is 80% better than simply removing any character above ascii 127.

dan 2010-07-22 23:25:27

ansaurus

tags:

views:

answers:

How to change diacritic characters to non-diacritic ones

related questions