tags:

views:

163

answers:

5

I want a translation routine that allows me to translate any character to any other character or set of characters efficiently. The obvious way seems to be to use the value of a character from the input string as an index into a 256-entry translation array.

Given an initial array where each entry is set to its value, e.g. hex'37' would appear in the 56th entry (allowing 00 to be the first), the user could then substitute any characters required in the translate string.

e.g.1 I want to map a string with "A" for alphabetic characters, "N" for numeric characters, "B" for space characters and "X" for anything else. Thus "SL5 3QW" becomes "AANBNAA".

e.g.2. I want to translate some characters, such as "œ" (x'9D') to "oe" (x'6F65'), "ß" to "ss", "å" to "a", etc.

How do I get a numeric value from a character in the input string to use it as an index into the translate array?

It's easy with function CODE in Excel and straightforward in IBM assembler, but I can't track down a method in Java.

+4  A: 

The latest version of Unicode contains over 107000 characters. A 256-entry translation array won't cut it.

That said, you can get the codepoint at an index in a string using the String.codepointAt(int index) method.

You might also want to use Character.isWhitespace(int codepoint) and Character.isDigit(int codepoint) and so on.

See also http://download.oracle.com/javase/6/docs/api/java/lang/String.html and http://download.oracle.com/javase/6/docs/api/java/lang/Character.html

Christoffer Hammarström
It's chilling to see how many people who understand Unicode quite well still advocate using charAt() - if characters outside the BMP ever see mainstream use, most Java code out there will break in subtle ways; feels like 1995 all over again, when your chances of getting non-ASCII characters sent via email and displayed correctly using software intended to be used by English speakers were something like 10%.
Michael Borgwardt
@Michael: Agreed, the API is outdated. It's hard enough to understand this stuff without the API misleading you.
Christoffer Hammarström
+1  A: 

As Christoffer says, with Unicode characters a 256-element array is not enough.

One way is to use a HashMap<Character,String> mapping each Character to the desired translated value, and use String.charAt() to extract each character in turn. You might also look at some of the methods on the Character class like isDigit() and isLetter() to do some of the work; that might be easier than constructing a mapping for every "letter" (in multiple languages, perhaps).

By using a HashMap, you only need to define mappings for the characters you wish to translate. For ones that don't have a mapping (hashmap returns null) you could either specify a default value or pass them through unchanged.

David Gelhar
+4  A: 

This is a bit off topic, but if you want to do a comprehensive job of character translation, you cannot simply use String.charAt(int). Unicode codepoints larger than 65535 are represented in Java Strings as two consecutive char values.

The clean way to deal with this is to use the String.codepointAt(int) to extract each codepoint, and String.offsetByCodePoints(int, int) to step through the codepoint positions.

Stephen C
D'oh, you're right. I've updated my answer.
Christoffer Hammarström
A: 

There are different ways to answer this question. The easiest way is probably to come up with answers for each of the problems individually:


Problem 1:

e.g.1 I want to map a string with "A" for alphabetic characters, "N" for numeric characters, "B" for space characters and "X" for anything else. Thus "SL5 3QW" becomes "AANBNAA".

Simple solution:

public static String map(final String input){
    final char[] out = new char[input.length()];
    for(int i = 0; i < input.length(); i++){
        final char c = input.charAt(i);
        final char t;
        if(Character.isDigit(c)){
            t = 'N';
        } else if(Character.isWhitespace(c)){
            t = 'B';
        } else if(Character.isLetter(c)){
            t = 'A';
        } else{
            t = 'X';
        }
        out[i] = t;
    }
    return new String(out);
}

Test:

public static void main(final String[] args){
    System.out.println(map("SL5 3QW"));
}

Output:

AANBNAA


Problem 2:

e.g.2. I want to translate some characters, such as "œ" (x'9D') to "oe" (x'6F65'), "ß" to "ss", "å" to "a", etc.

Solution:

This is standard functionality, you should use the Normalizer API for this. See these previous answers for reference.


The Big Picture

But on second thought there is of course a more general solution to your problem. Let's see how many downvotes I get for this one by the if/else lovers. Define an interface of a transformer that accepts certain characters and / or character classes and maps them to other characters:

public interface CharTransformer{
    boolean supports(char input);
    char transform(char input);
}

And now define a method that you can call with a string and a collection of such transformers. For every single character, each transformer will be queried to see if he supports this character. If he does, let him do the transformation. If no Transformer is found for a character, throw an exception.

public static String mapWithTransformers(final String input,
    final Collection<? extends CharTransformer> transformers){
    final char[] out = new char[input.length()];
    for(int i = 0; i < input.length(); i++){
        final char c = input.charAt(i);
        char t = 0;
        boolean matched = false;
        for(final CharTransformer tr : transformers){
            if(tr.supports(c)){
                matched = true;
                t = tr.transform(c);
                break;
            }
        }
        if(!matched){
            throw new IllegalArgumentException("Found no Transformer for char: "
                + c);
        }
        out[i] = t;
    }
    return new String(out);
}

One more thing: Maps

Note: Others have suggested using a Map. While I don't think a standard map is good for this task, you could use Guava's MapMaker.makeComputingMap(function) to calculate the replacements as needed (and automatically cache them). That way you have a lazily initialized caching map.

seanizer
Splendid! Thanks. A cornucopia of things to try. I've clearly a lot to learn!
Steve
You're welcome. Hint: The common way to say thank you here is to click the up arrow next to the question, or if you really like it, the checkmark.
seanizer
+2  A: 

HashMap<String, String> should work just fine. No need to over-engineer such a simple problem.

Kdansky
+1 - this is easy to define, doesn't need to be hardcoded, and can handle sparse translations ... at the cost of extra heap, of course.
kdgregory
Yes; It is neither the fastest nor the smallest solution (memory wise), but for all standard cases it will do incredibly well and be ridiculously easy to change and maintain. That alone is worth to use it until it becomes a bottleneck, and only then should you look for a more complex solution. Honestly, it should be your benchmark.
Kdansky