views:

108

answers:

4

I have a WebForm search page that gets occasional hits from international visitors. When they enter in text, it appears to be plain ASCII a-z, 0-9 but they are printed in bold and my "is this text" logic can't handle the input. Is there any easy way in ASP.NET to convert unicode characters that equate to A-Z, 0-9 into plain old text?

Thanks!

James

A: 

You might try something like this:

Encoding.ASCII.GetString(Encoding.Convert(UnicodeEncoding, ASCIIEncoding, Encoding.Unicode.GetBytes(myString)));

Although, I'm not quire sure what the problem is with the input. What exactly are you doing with the text? Does it matter if it contains more than just ascii characters? And, I especially don't know what you mean by "they are printed in bold".

Mike Caron
+5  A: 

You are getting so-called "Fullwidth Forms" of the characters. In Unicode, these are encoded at codepoints U+FF01 to U+FF5E. To get the ASCII codepoint (U+0021 to U+007E) from them, you have to get their codepoint and subtract (0xFF01 - 0x0021) from it.

ASCII: http://unicode.org/charts/PDF/U0000.pdf
Fullwidth Forms: http://unicode.org/charts/PDF/UFF00.pdf

I don't speak ASP.NET, but in Java the code would look like this:

String decodeFullwidth(String s) {
  StringBuilder sb = new StringBuilder();
  for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    if (0xFF01 <= c && c <= 0xFF5E) {
      sb.append((char) (c - (0xFF01 - 0x0021)));
    } else {
      sb.append(c);
    }
  }
  return sb.toString();
}
Roland Illig
+3  A: 

it appears to be plain ASCII a-z, 0-9 but they are printed in bold

This could be the Unicode "mathematical bold" characters . But more likely it's the "fullwidth" characters abcdefghijklmnopqrstuvwxyz0123456789. (These are common in East Asian character encodings: "Fullwidth" refers to being the same width as a Hanzi/Kanji character.)

To convert either set to ASCII, use the Unicode normalization form KC or KD.

dan04
Yup! All the answers were very helpful but based on the "normalization" and "KC", "KD" I was able to determine that i only needed to call String.Normalize(NormalizationForm.FormKC) to handle the incoming wide characters. Thanks!
+2  A: 

You should look at the answer from this question.

It includes the following method (from Michael Kaplan's blog entry "Stripping is an interesting job"):

static string RemoveDiacritics(string stIn) {
  string stFormD = stIn.Normalize(NormalizationForm.FormD);
  StringBuilder sb = new StringBuilder();

  for(int ich = 0; ich < stFormD.Length; ich++) {
    UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
    if(uc != UnicodeCategory.NonSpacingMark) {
      sb.Append(stFormD[ich]);
    }
  }

  return(sb.ToString().Normalize(NormalizationForm.FormC));
}

This will strip all the NonSpacingMark characters from a string. This means it will convert é to e, because é is actually build from an e and ´ character.
The ´ is a "NonSpacingMark", meaning that it will be added to the previous character. The method tries to detect this special characters, and rebuilds a string without NonSpacingMark characters. (This is how I understand it, this might not be true).

This will not work for all unicode characters, but an input from users using a latin-based character set (English, Spanish, French, German, etc) will be "cleaned". I have no experience with Asian character sets.


After feedback

I adjusted the routine to the info I got from comments and answers to this question. My current version is:

    public static string RemoveDiacritics(string stIn) {
        string stFormD = stIn.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for (int ich = 0; ich < stFormD.Length; ich++) {
            UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
            switch (uc) {
                case UnicodeCategory.NonSpacingMark:
                    break;
                case UnicodeCategory.DecimalDigitNumber:
                    sb.Append(CharUnicodeInfo.GetDigitValue(stFormD[ich]).ToString());
                    break;
                default:
                    sb.Append(stFormD[ich]);
                    break;
            }
        }

        return (sb
            .ToString()
            .Normalize(NormalizationForm.FormKC));
    }

This routing, will remove diacritics (as much as possible), and will convert the other "strange" characters into their "normal" form.

GvS
Aha, now I get it, at first it looked like alot of extra code relative to straight String.Normalize() but your e' example is a great one. Since this seems more correct, but also more expensive do you think I could test for the diacritics with something like IsNormalized()?
IsNormalized indicates if a string is in one of the normalization formss. If you are worried about the performance, you could check if the string in FormD is different from the original from, using CompareOrdinal.
GvS
Just a note for future readers, the last call to "Normalize(NormalizationForm.FormC)" won't reduce the string in the question to "latinish" but using Form.KC will. In fact, if you aren't concerned about the non-spacing marks, the whole shebang could be myCrazyString.Normalize(NormalizationForm.KC);
Hey, great, I have adjusted my "string cleanup routine" to the feedback from this question. I will post the new code I use.
GvS