You should look at the answer from this question.
It includes the following method (from Michael Kaplan's blog entry "Stripping is an interesting job"):
static string RemoveDiacritics(string stIn) {
string stFormD = stIn.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
for(int ich = 0; ich < stFormD.Length; ich++) {
UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
if(uc != UnicodeCategory.NonSpacingMark) {
sb.Append(stFormD[ich]);
}
}
return(sb.ToString().Normalize(NormalizationForm.FormC));
}
This will strip all the NonSpacingMark characters from a string. This means it will convert é
to e
, because é
is actually build from an e
and ´
character.
The ´
is a "NonSpacingMark", meaning that it will be added to the previous character. The method tries to detect this special characters, and rebuilds a string without NonSpacingMark characters. (This is how I understand it, this might not be true).
This will not work for all unicode characters, but an input from users using a latin-based character set (English, Spanish, French, German, etc) will be "cleaned". I have no experience with Asian character sets.
After feedback
I adjusted the routine to the info I got from comments and answers to this question. My current version is:
public static string RemoveDiacritics(string stIn) {
string stFormD = stIn.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
for (int ich = 0; ich < stFormD.Length; ich++) {
UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
switch (uc) {
case UnicodeCategory.NonSpacingMark:
break;
case UnicodeCategory.DecimalDigitNumber:
sb.Append(CharUnicodeInfo.GetDigitValue(stFormD[ich]).ToString());
break;
default:
sb.Append(stFormD[ich]);
break;
}
}
return (sb
.ToString()
.Normalize(NormalizationForm.FormKC));
}
This routing, will remove diacritics (as much as possible), and will convert the other "strange" characters into their "normal" form.