views:

404

answers:

5

What's the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8 for example the characters have a variable byte length, so each character needs to be determined individually. As far now I've come up with this:

char c = getCharSomehow();
String encoding = getEncodingSomehow();
// ...
int length = new String(new char[] { c }).getBytes(encoding).length;

But this is clumsy and inefficient in a loop since a new String needs to be created everytime. I can't find other and more efficient ways in the Java API. There's a String#valueOf(char), but according its source it does basically the same as above. I imagine that this can be done with bitwise operations like bit shifting, but that's my weak point and I'm unsure how to take the encoding into account here :)

If you question the need for this, check this topic.


Update: the answer from @Bkkbrad is technically the most efficient:

char c = getCharSomehow();
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char[] { c })).limit();

However as @Stephen C pointed out, there are more problems with this. There may for example be combined/surrogate characters which needs to be taken into account as well. But that's another problem which needs to be solved in the step before this step.

+2  A: 

It is possible that an encoding scheme could encode a given character as a variable number of bytes, depending on what comes before and after it in the character sequence. The byte length you get from encoding a single character String is therefore not the whole answer.

(For example, you could theoretically receive a baudot / teletype characters encoded as 4 characters every 3 bytes, or you could theoretically treat a UTF-16 + a stream compressor as an encoding scheme. Yes, it is all a bit implausible, but ...)

Stephen C
Yes, good point, the surrogate characters has indeed to be taken into account sooner or later.
BalusC
+2  A: 

If you can guarantee that the input is well-formed UTF-8, then there's no reason to find code points at all. One of the strengths of UTF-8 is that you can detect the start of a code point from any position in the string. Simply search backwards until you find a byte such that (b & 0xc0) != 0x80, and you've found another character. Since a UTF-8 encoded code point is always 6 bytes or less, you can copy the intermediate bytes into a fixed-length buffer.

Edit: I forgot to mention, even if you don't go with this strategy, it is not sufficient to use a Java "char" to store arbitrary code points since code point values can exceed 0xffff. You need to store code points in an "int".

bkail
Very good tip. Unfortunately there's likely no 100% guarantee.
BalusC
@bkail: +1 to you for you're the only one in this thread to mention that a Java *char* cannot store arbitrary codepoints and that *int* should be used instead.
Webinator
A: 

Try Charset.forName("UTF-8").encode("string").limit(); Might be a bit more efficient, maybe not.

Chris Dennett
This still require a `String` as input.
BalusC
A: 

It would be very easy to create an efficient hashtable of all the characters in UTF-8 that maps to their length.

Basically do your above code statically walking through all the character combinations and put them in a hashtable. (You do this only on startup)

You'll want to use Trove and not the Java collections.

This would be the fastest but obviously use a pretty big chunk of memory.

EDIT:This is probably a dumb idea as the number of unicode characters is extremly large.

Adam Gent
+1  A: 

Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.

On my system, the following code takes 25 seconds to encode 100,000 single characters:

Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        int len = new String(array).getBytes(utf8).length;
    }
}

However, the following code does the same thing in under 4 seconds:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        output.clear();
        input.clear();
        encoder.encode(input, output, false);
        int len = output.position();
    }
}

Edit: Why do haters gotta hate?

Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);

int limit = input.limit();
while(input.position() < limit) {
    output.clear();
    input.mark();
    input.limit(Math.max(input.position() + 2, input.capacity()));
    if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
        //Malformed surrogate pair; do something!
    }
    input.limit(input.position());
    input.reset();
    encoder.encode(input, output, false);
    int encodedLen = output.position();
}
Bkkbrad
Technically, this is the best answer as far (if you replace `position()` by `limit()`). This is indeed much efficienter.
BalusC
@Bkkbrad: A Java char is totally inadequate since 1993 or so to represent a Unicode character, when Unicode moved to 1.1 and had more than 65 536 codepoints. The method to use to get a character in Java is String's *codePointAt(..)* which correctly returns an *int*. Java *char* is, well, utterly broken. (200 KLOC codebase here and we're using Java char, well... **zero** times).
Webinator
@WizardOfOdds: I added a solution to keep track of surrogate pairs.
Bkkbrad