tags:

views:

7658

answers:

6

We have a system where customers, mainly European enter texts (in UTF-8) that has to be distributed to different systems, most of them accepting UTF-8, but now we must also distribute the texts to a US system which only accepts US-Ascii 7-bit

So now we'll need to translate all European characters to the nearest US-Ascii. Is there any Java libraries to help with this task?

Right now we've just started adding to a translation table, where Å (swedish AA)->A and so on and where we don't find any match for an entered character, we'll log it and replace with a question mark and try and fix that for the next release, but it seems very inefficient and somebody else must have done something similair before.

A: 

There are some built in functions to do this. The main class involved is CharsetEncoder, which is part of the nio package. A simpler way is String.getBytes(Charset) that can be sent to a ByteArrayOutputStream.

sblundy
This doesn't address the normalization from 'é' to 'e'.
Joe Liversedge
+10  A: 

The uni2ascii program is written in C, but you could probably convert it to Java with little effort. It contains a large table of approximations (implicitly, in the switch-case statements).

Be aware that there are no universally accepted approximations: Germans want you to replace Ä by AE, Finns and Swedes prefer just A. Your example of Å isn't obvious either: Swedes would probably just drop the ring and use A, but Danes and Norwegians might like the historically more correct AA better.

Jouni K. Seppänen
+5  A: 

Instead of creating your own table, you could instead convert the text to normalization form D, where the characters are represented as a base character plus the diacritics (for instance, "á" will be replaced by "a" followed by a combining acute accent). You can then strip everything which is not an ASCII letter.

The tables still exist, but are now the ones from the Unicode standard.

You could also try NFKD instead of NFD, to catch even more cases.

References:

CesarB
Related answer: http://stackoverflow.com/questions/225471/how-do-i-replace-accented-latin-characters-in-ruby#226090
CesarB
A: 

This is typically useful in search applications. See the corresponding Lucene ISOLatin1AccentFilter implementation. This isn't really designed for plugging into a random local implementation, but does the trick.

Joe Liversedge
A: 

This is what seems to work:

private synchronized static String utftoasci(String s){ final StringBuffer sb = new StringBuffer( s.length() * 2 );

final StringCharacterIterator iterator = new StringCharacterIterator( s );

char ch = iterator.current();

while( ch != StringCharacterIterator.DONE ){ if(Character.getNumericValue(ch)>0){ sb.append( ch ); }else{ boolean f=false; if(Character.toString(ch).equals("Ê")){sb.append("E");f=true;} if(Character.toString(ch).equals("È")){sb.append("E");f=true;} if(Character.toString(ch).equals("ë")){sb.append("e");f=true;} if(Character.toString(ch).equals("é")){sb.append("e");f=true;} if(Character.toString(ch).equals("è")){sb.append("e");f=true;} if(Character.toString(ch).equals("è")){sb.append("e");f=true;} if(Character.toString(ch).equals("Â")){sb.append("A");f=true;} if(Character.toString(ch).equals("ä")){sb.append("a");f=true;} if(Character.toString(ch).equals("ß")){sb.append("ss");f=true;} if(Character.toString(ch).equals("Ç")){sb.append("C");f=true;} if(Character.toString(ch).equals("Ö")){sb.append("O");f=true;} if(Character.toString(ch).equals("º")){sb.append("");f=true;} if(Character.toString(ch).equals("Ó")){sb.append("O");f=true;} if(Character.toString(ch).equals("ª")){sb.append("");f=true;} if(Character.toString(ch).equals("º")){sb.append("");f=true;} if(Character.toString(ch).equals("Ñ")){sb.append("N");f=true;} if(Character.toString(ch).equals("É")){sb.append("E");f=true;} if(Character.toString(ch).equals("Ä")){sb.append("A");f=true;} if(Character.toString(ch).equals("Å")){sb.append("A");f=true;} if(Character.toString(ch).equals("ä")){sb.append("a");f=true;} if(Character.toString(ch).equals("Ü")){sb.append("U");f=true;} if(Character.toString(ch).equals("ö")){sb.append("o");f=true;} if(Character.toString(ch).equals("ü")){sb.append("u");f=true;} if(Character.toString(ch).equals("á")){sb.append("a");f=true;} if(Character.toString(ch).equals("Ó")){sb.append("O");f=true;} if(Character.toString(ch).equals("É")){sb.append("E");f=true;} if(!f){ sb.append("?"); } } ch = iterator.next(); } return sb.toString(); }

Rob
+2  A: 

You can do this with the following (from the NFD example in this Core Java Technology Tech Tip):

public static String decompose(String s) {
    return java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+","");
}
Simon Lieschke