ansaurus

Question

Method to substitute foreign for English characters in Java?

Answer 1

+1 A:

You're going to have to do a loop:

String text = "Je prends une thé chaud, s'il vous plaît";
Map<Character, String> replace = new HashMap<Character, String>();
replace.put('é', "e");
replace.put('î', "i");
replace.put('è', "e");
StringBuilder s = new StringBuilder();
for (int i=0; i<text.length(); i++) {
  char c = text.charAt(i);
  String rep = replace.get(c);
  if (rep == null) {
    s.append(c);
  } else {
    s.append(rep);
  }
}
text = s.toString();

Note: Some characters are replaced with multiple characters. In German, for example, u-umlaut is converted to "ue".

Edit: Made it much more efficient.

cletus 2009-06-19 08:55:03

Except in Java. (Sorry, couldn't resist.)

Samir Talwar 2009-06-19 08:56:20

Ugh. Please don't use the regexp-based method for that.

Michael Borgwardt 2009-06-19 09:03:06

On second glance, I suppose you have to if you want to replace single characters with multiple ones, but I'm not sure the OP wants that. It would have to be implemented on a per-locale basis and probably end in an ad-hoc mess - I don't think all languages have clear-cut established rules for substituting accented characters like German has.

Michael Borgwardt 2009-06-19 09:35:53

This will be very inefficient for long strings (O(n^2)), because for each accented character the whole string is traversed.

starblue 2009-06-19 10:52:25

Answer 2

A:

There's no standard method as far as I know, but here's a class that does what you want:

http://www.javalobby.org/java/forums/t19704.html

finnw 2009-06-19 08:59:45

Answer 3

A:

You'll need a loop.

An efficient solution would be something like the following:

    Map<Character, Character> map = new HashMap<Character, Character>();
    map.put('é', 'e');
    map.put('î', 'i');
    map.put('è', 'e');

    StringBuilder b = new StringBuilder();
    for (char c : text.toCharArray())
    {
        if (map.containsKey(c))
        {
            b.append(map.get(c));
        }
        else
        {
            b.append(c);
        }
    }
    String result = b.toString();

Of course in a real program you would encapsulate both the construction of the map and the replacement in their respective methods.

starblue 2009-06-19 09:10:59

Answer 4

+2 A:

There's no method that works identically to the PHP one in the standard API, though there may be something in Apache Commons. You could do it by replacing the characters individually:

s = s.replace('é','e').replace('î', 'i').replace('è', 'e');

A more sophisticated method that does not require you to enumerate the characters to substitute (and is thus more likely not to miss anything) but does require a loop (which will happen anyway internally, whatever method you use) would be to use java.text.Normalizer to separate letters and diacritics and then strip out everything with a character type of Character.MODIFIER_LETTER.

Michael Borgwardt 2009-06-19 09:13:13

Answer 5

+3 A:

A really nice way to do it is using the replaceEach() method from the StringUtils class in Apache Commons Lang 2.4.

String text = "Je prends une thé chaud, s'il vous plaît";
String[] search = new String[] {"é", "î", "è"};
String[] replace = new String[] {"e", "i", "e"};
String newText = StringUtils.replaceEach(text, 
    search, 
    replace);

Results in

Je prends une the chaud, s'il vous plait

Harry Lime 2009-06-19 09:13:25

It's rarely worth adding a library dependency for a function thats trivial to implement.

cletus 2009-06-19 10:48:33

Trivial to implement, not so trivial to test maybe. By using a library as widely used as commons-lang you can be reasonably confident that it works well.

Harry Lime 2009-06-19 11:09:56

One could just as well say that it's rarely worth re-implementing a utility (adding more of own code to test and maintain) when a perfectly good implementation already exists in a widely-used library.

Jonik 2009-06-19 11:12:18

There's almost certainly far more than this one function in Apache Commons that would be useful to your project.

Michael Borgwardt 2009-06-19 11:17:48

@Michael: I agree; although you probably meant in Commons *Lang* (as Apache Commons consists of several independently released libraries). @Harry, could you actually correct this in the answer -> "in Apache Commons Lang 2.4"

Jonik 2009-06-19 11:27:16

@Jonik Made that change. Thanks.

Harry Lime 2009-06-19 11:41:56

Answer 6

+2 A:

I'm not a Java guy, but I'd recommend a generic solution using the Normalizer class to decompose accented characters and then remove the Unicode "COMBINING" characters.

devio 2009-06-19 10:03:29

Michael Borgwardt mentioned stripping out Character.MODIFIER_LETTER chars. Which one is it, or did you basically mean the same thing?

Jonik 2009-06-19 10:39:01

+1 Interesting!

starblue 2009-06-19 10:54:57

Formally, Unicode category Lm, to which Character.MODIFIER_LETTER corresponds. That's clearly what's needed here: http://www.dpawson.co.uk/xsl/rev2/UnicodeCategories.html. Category Mc "Mark, spacing combining" only seems to apply to certain Asian languages.

Michael Borgwardt 2009-06-19 11:24:20

I meant Unicode characters whose name contains "COMBINING". This seems to be the Mn category (Mark, non-spacing) as per Michael's link

devio 2009-06-19 11:47:31

Jonik/Michael: Just removing Lm won't work for combined letter like "Æ". You have to do a "KD" normalization before removing Lm.

J-16 SDiZ 2009-06-19 12:00:19

ansaurus

tags:

views:

answers:

Method to substitute foreign for English characters in Java?

related questions