views:

1166

answers:

5

Suppose I have a string that contains Ü. How would I find all those unicode characters? Should I test for their code? How would I do that?

For example, given the string "AÜXÜ", I'd like to transform it to "AYXY". I'd like to do the same for other unicode characters, and I would hate to have to store them in a translation map of some sort.

+3  A: 

You could go the other way round and ask if the character is a ascii character.

public static boolean isAscii(char ch) {
    return ch < 128;
}

You'd have to analyse the string char by char then of course.

(the method is from commons-lang Charutils which contains loads of useful Character methods)

msparer
+1  A: 

I'm not sure from your example what you're trying to do - if you're just trying to replace all non-ASCII values with Y, then you could loop through the string looking for codepoints outside of the range 0 to 127, and replace them those code points with Y.

Dominic Rodger
+7  A: 

Your definition of "unicode characters" is a bit vague. This is usually used by starters to denote all UTF-8 characters which are NOT covered by the standard ISO 8859 charset. Is this true in your case? If so, then you likely need to loop through every character of the String and test its codepoint if it is covered by the ISO 8859 charset or not.

You can also just have a Map and do the replace in a loop if the map contains the key. For example:

Map<Character, Character> charReplacementMap = new HashMap<Character, Character>() {{
    put('Ü', 'Y');
    // Put more here.
}};

String originalString = "AÜAÜ";
StringBuilder builder = new StringBuilder();

for (char currentChar : originalString.toCharArray()) {
    Character replacementChar = charReplacementMap.get(currentChar);
    builder.append(replacementChar != null ? replacementChar : currentChar);
}

String newString = builder.toString();

Or, do you mean "all characters whith diacritical marks" with it? If so, then you can use java.text.Normalizer to get rid of all diacritical marks:

/**
 * Remove any diacritical marks (accents like ç, ñ, é, etc) from
 * the given string (so that it returns plain c, n, e, etc).
 * @param string The string to remove diacritical marks from.
 * @return The string with removed diacritical marks, if any.
 */
public static String removeDiacriticalMarks(String string) {
    return Normalizer.normalize(string, Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

One pitfall, Ü would become U, not Y. Not sure if that's what you're after. If you want to replace by pronounced character, you'll really need to create a mapping. Sure, it's a tedious work, but it's done in less time than you needed to follow this topic ;)

BalusC
It's how I usually did it. But this would require you add each character in the map.
Geo
I don't see any other efficient option to replace a certain character by a certain character and that for more than one character.
BalusC
If you don't add each character to the map, how do you define the replacement? Or do you want all non-ascii characters replaced by a single ascii character?
C. Ross
@BalusC - actually, the real definition of what is a Unicode character (codepoint) is very precise. The problem is that the OP does not understand that the ASCII characters are a proper subset of the Unicode codepoints.
Stephen C
Or do you just want to remove diacritical marks? I've edited my post with it.
BalusC
@Stephen C: Excatly.
BalusC
That normalizer was really great. Now I don't need to replace anything anymore. Will this work regardless of encoding?
Geo
Yes. Even more, Strings are always UTF-16 encoded.
BalusC
+1 for java.text.Normalizer
Tim Büthe
@Geo NFD form will separate combined characters into a separate character and accent (NFC does the opposite). `"\\p{InCombiningDiacriticalMarks}+"` will match the accents for replacement. But do you need to handle Greek letters, or Japanese text, and so on?
McDowell
@McDowell: sounds like Turkish only. At least, I don't know any other language where Ü is pronounced as Y.
BalusC
+2  A: 

You could loop through your string and for every character call

If (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
 // replace with Y
}
jitter
Good one to test codepoints, but I don't have the impression that he want to replace *every* character by Y.
BalusC
Well he says unicode characters by that I understand that he probably means replace all non ascii characters with Y. whatever
jitter
+1  A: 

It isn't clear to me exactly what is gained by transforming "AÜXÜ" to "AYXY". Is this because Ü is pronounced like Y in a particular language? What language? And what other rules might apply?


In terms of terminology...

"a"

The above is a Unicode string. It contains a single UTF-16 encoded character.

If you wish to limit the range of characters to the English alphabet, have a look at the Normalization performed in this answer.

McDowell
It was just a replacement example. I'll actually replace the character by `_XX_` :)
Geo