ansaurus

Question

Answer 1

+4 A:

The latest version of Unicode contains over 107000 characters. A 256-entry translation array won't cut it.

That said, you can get the codepoint at an index in a string using the String.codepointAt(int index) method.

You might also want to use Character.isWhitespace(int codepoint) and Character.isDigit(int codepoint) and so on.

See also http://download.oracle.com/javase/6/docs/api/java/lang/String.html and http://download.oracle.com/javase/6/docs/api/java/lang/Character.html

Christoffer Hammarström 2010-10-04 14:56:55

It's chilling to see how many people who understand Unicode quite well still advocate using charAt() - if characters outside the BMP ever see mainstream use, most Java code out there will break in subtle ways; feels like 1995 all over again, when your chances of getting non-ASCII characters sent via email and displayed correctly using software intended to be used by English speakers were something like 10%.

Michael Borgwardt 2010-10-04 15:25:33

@Michael: Agreed, the API is outdated. It's hard enough to understand this stuff without the API misleading you.

Christoffer Hammarström 2010-10-04 15:36:53

Answer 2

+1 A:

As Christoffer says, with Unicode characters a 256-element array is not enough.

One way is to use a HashMap<Character,String> mapping each Character to the desired translated value, and use String.charAt() to extract each character in turn. You might also look at some of the methods on the Character class like isDigit() and isLetter() to do some of the work; that might be easier than constructing a mapping for every "letter" (in multiple languages, perhaps).

By using a HashMap, you only need to define mappings for the characters you wish to translate. For ones that don't have a mapping (hashmap returns null) you could either specify a default value or pass them through unchanged.

David Gelhar 2010-10-04 15:01:39

Answer 3

+4 A:

This is a bit off topic, but if you want to do a comprehensive job of character translation, you cannot simply use String.charAt(int). Unicode codepoints larger than 65535 are represented in Java Strings as two consecutive char values.

The clean way to deal with this is to use the String.codepointAt(int) to extract each codepoint, and String.offsetByCodePoints(int, int) to step through the codepoint positions.

Stephen C 2010-10-04 15:17:04

D'oh, you're right. I've updated my answer.

Christoffer Hammarström 2010-10-04 15:21:14

Answer 4

A:

There are different ways to answer this question. The easiest way is probably to come up with answers for each of the problems individually:

Problem 1:

e.g.1 I want to map a string with "A" for alphabetic characters, "N" for numeric characters, "B" for space characters and "X" for anything else. Thus "SL5 3QW" becomes "AANBNAA".

Simple solution:

public static String map(final String input){
    final char[] out = new char[input.length()];
    for(int i = 0; i < input.length(); i++){
        final char c = input.charAt(i);
        final char t;
        if(Character.isDigit(c)){
            t = 'N';
        } else if(Character.isWhitespace(c)){
            t = 'B';
        } else if(Character.isLetter(c)){
            t = 'A';
        } else{
            t = 'X';
        }
        out[i] = t;
    }
    return new String(out);
}

Test:

public static void main(final String[] args){
    System.out.println(map("SL5 3QW"));
}

Output:

AANBNAA

Problem 2:

e.g.2. I want to translate some characters, such as "œ" (x'9D') to "oe" (x'6F65'), "ß" to "ss", "å" to "a", etc.

Solution:

This is standard functionality, you should use the Normalizer API for this. See these previous answers for reference.

The Big Picture

But on second thought there is of course a more general solution to your problem. Let's see how many downvotes I get for this one by the if/else lovers. Define an interface of a transformer that accepts certain characters and / or character classes and maps them to other characters:

public interface CharTransformer{
    boolean supports(char input);
    char transform(char input);
}

And now define a method that you can call with a string and a collection of such transformers. For every single character, each transformer will be queried to see if he supports this character. If he does, let him do the transformation. If no Transformer is found for a character, throw an exception.

public static String mapWithTransformers(final String input,
    final Collection<? extends CharTransformer> transformers){
    final char[] out = new char[input.length()];
    for(int i = 0; i < input.length(); i++){
        final char c = input.charAt(i);
        char t = 0;
        boolean matched = false;
        for(final CharTransformer tr : transformers){
            if(tr.supports(c)){
                matched = true;
                t = tr.transform(c);
                break;
            }
        }
        if(!matched){
            throw new IllegalArgumentException("Found no Transformer for char: "
                + c);
        }
        out[i] = t;
    }
    return new String(out);
}

One more thing: Maps

Note: Others have suggested using a Map. While I don't think a standard map is good for this task, you could use Guava's MapMaker.makeComputingMap(function) to calculate the replacements as needed (and automatically cache them). That way you have a lazily initialized caching map.

seanizer 2010-10-04 16:06:20

Splendid! Thanks. A cornucopia of things to try. I've clearly a lot to learn!

Steve 2010-10-05 06:51:57

You're welcome. Hint: The common way to say thank you here is to click the up arrow next to the question, or if you really like it, the checkmark.

seanizer 2010-10-05 06:53:41

Answer 5

+2 A:

HashMap<String, String> should work just fine. No need to over-engineer such a simple problem.

Kdansky 2010-10-04 16:18:57

+1 - this is easy to define, doesn't need to be hardcoded, and can handle sparse translations ... at the cost of extra heap, of course.

kdgregory 2010-10-04 16:39:13

Yes; It is neither the fastest nor the smallest solution (memory wise), but for all standard cases it will do incredibly well and be ridiculously easy to change and maintain. That alone is worth to use it until it becomes a bottleneck, and only then should you look for a more complex solution. Honestly, it should be your benchmark.

Kdansky 2010-10-04 17:55:29

ansaurus

tags:

views:

answers:

How do I translate strings using Java?

Problem 1:

Problem 2:

The Big Picture

One more thing: Maps

related questions