views:

101

answers:

3

Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one? Example:

Input: orčpžsíáýd

Output: orcpzsiayd

It doesn't need to include all letters with accents like the Russian alphabet or the Chinese one.

+1  A: 

Create a map that maps the accented character to the unaccented one. Then iterate over map. Use the key of the map as the character you want to replace, and the value of the map as the replacement character.

Map<String, String> replacements = new LinkedHashMap<String,String>() {{
    put("č", "c");
    put("ž", "z");
    ...
    put("ý", "y");
}};

then:

for(Map.Entry<String, String> entry : replacements.entrySet()) {
    inputStr.replaceAll(entry.getKey(), entry.getValue());
}

This way you cut out the multiple replaceAll calls and have a single one.

Based on Erick's reponse I found this page that talks about the different uses of Normalizer.

Vivin Paliath
+12  A: 

java.text.Normalizer will handle this for you.

string = Normalizer.normalize(string, Normalizer.Form.NFD);

This will separate all of the accent marks from the characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.

string = string.replaceAll("[^\\p{ASCII}]", "");
Erick Robertson
+1 Didn't know about normalizer!
Vivin Paliath
I like this approach :-) thx
Martin S.
+1  A: 

Depending on the language, those might not be considered accents (which change the sound of the letter), but diacritical marks

https://secure.wikimedia.org/wikipedia/en/wiki/Diacritic#Languages_with_letters_containing_diacritics

"Bosnian and Croatian have the symbols č, ć, đ, š and ž, which are considered separate letters and are listed as such in dictionaries and other contexts in which words are listed according to alphabetical order."

Removing them might be inherently changing the meaning of the word, or changing the letters into completely different ones.

NinjaCat
Agreed. For example in swedish: "höra" (hear) -> "hora" (whore)
Christoffer Hammarström
It doesn't matter what they mean. The question is how to remove them.
Erick Robertson