views:

347

answers:

3

Hi,

we are accepting all sorts of national characters in UTF-8 string on the input, and we need to convert them to ASCII string on the output for some legacy use. (we don't accept Chinese and Japanese chars, only European languages)

We have a small utility to get rid of all the diacritics:

public static final String toBaseCharacters(final String sText) {
    if (sText == null || sText.length() == 0)
        return sText;

    final char[] chars = sText.toCharArray();
    final int iSize = chars.length;
    final StringBuilder sb = new StringBuilder(iSize);

    for (int i = 0; i < iSize; i++) {
        String sLetter = new String(new char[] { chars[i] });
        sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFC);

        try {
            byte[] bLetter = sLetter.getBytes("UTF-8");
            sb.append((char) bLetter[0]);
        } catch (UnsupportedEncodingException e) {
        }
    }
    return sb.toString();
}

The question is how to replace all the german sharp s (ß, Đ, đ) and other characters that get through the above normalization method, with their supplements (in case of ß, supplement would probably be "ss" and in case od Đ supplement would be either "D" or "Dj").

Is there some simple way to do it, without million of .replaceAll() calls?

So for example: Đonardan = Djonardan, Blaß = Blass and so on.

We can replace all "problematic" chars with empty space, but would like to avoid this to make the output as similar to the input as possible.

Thank you for your answers,

Bozo

+1  A: 

Is there some simple way to do it, without million of .replaceAll() calls?

If you just support European, Latin-based languages, around 100 should be enough; that's definitely doable: Grab the Unicode charts for Latin-1 Supplement and Latin Extended-A and get the String.replace party started. :-)

Heinzi
I cannot believe that no one did this, made a few maps and said, here is one for people who prefer it this or that way, extend it if you wish some modifications per your needs.
bozo
+1  A: 

You want to use ICU4J. It includes the com.ibm.icu.text.Transliterator class, which apparently can do what you are looking for.

Thomas Pornin
Except that the ICU4J transliterators I've tried are extremely inaccurate (latin, cyrillic and hangul), which exact transliterator do you think would fulfill the original request? I am not able to find anything apparently suitable.
jarnbjo
I've tried ICU4J and it was so complicated that I couldn't even run it.
bozo