ansaurus

Question

Java - getting rid of accents and converting them to regular letters

Answer 1

+1 A:

Create a map that maps the accented character to the unaccented one. Then iterate over map. Use the key of the map as the character you want to replace, and the value of the map as the replacement character.

Map<String, String> replacements = new LinkedHashMap<String,String>() {{
    put("č", "c");
    put("ž", "z");
    ...
    put("ý", "y");
}};

then:

for(Map.Entry<String, String> entry : replacements.entrySet()) {
    inputStr.replaceAll(entry.getKey(), entry.getValue());
}

This way you cut out the multiple replaceAll calls and have a single one.

Based on Erick's reponse I found this page that talks about the different uses of Normalizer.

Vivin Paliath 2010-07-23 20:36:55

Answer 2

+12 A:

java.text.Normalizer will handle this for you.

string = Normalizer.normalize(string, Normalizer.Form.NFD);

This will separate all of the accent marks from the characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.

string = string.replaceAll("[^\\p{ASCII}]", "");

Erick Robertson 2010-07-23 20:38:02

+1 Didn't know about normalizer!

Vivin Paliath 2010-07-23 20:42:18

I like this approach :-) thx

Martin S. 2010-07-23 20:46:22

Answer 3

+1 A:

Depending on the language, those might not be considered accents (which change the sound of the letter), but diacritical marks

https://secure.wikimedia.org/wikipedia/en/wiki/Diacritic#Languages_with_letters_containing_diacritics

"Bosnian and Croatian have the symbols č, ć, đ, š and ž, which are considered separate letters and are listed as such in dictionaries and other contexts in which words are listed according to alphabetical order."

Removing them might be inherently changing the meaning of the word, or changing the letters into completely different ones.

NinjaCat 2010-07-23 20:41:03

Agreed. For example in swedish: "höra" (hear) -> "hora" (whore)

Christoffer Hammarström 2010-10-05 07:08:23

It doesn't matter what they mean. The question is how to remove them.

Erick Robertson 2010-10-21 14:41:48

ansaurus

tags:

views:

answers:

Java - getting rid of accents and converting them to regular letters

related questions