views:

65

answers:

2

I have a byte offset for a byte array containing a UTF-8 encoded string, how can I transform that into a char offset for the corresponding Java String?

NB. this question used to read:

I have a byte offset into a standard Java String, and I would like to convert that to a character offset.

In practice this will mean a method like charOffsetBefore(int byteOffset) since any byte offset could be in the middle of a code point.

Thanks.

+3  A: 

Please be extremely wary of your terminology, otherwise you'll get confused. There is no such thing as "byte offset into a Java string". Java strings are made up from 16bit characters.

So I assume that you have a byte array and an offset and you want to convert that into a Java string and still preserve locations (so you can map back and forth).

This depend on the encoding of the byte array. If it's UTF-8, then any byte that has it's MSB set is part of a encoding sequence. Search for the byte which byte & 0xc0 == 0xc0. That's the start of the encoding sequence (see the Wikipedia article).

If you're asking about characters, then the encoding is UTF-16 and you need to look for surrogate pairs.

Aaron Digulla
+1  A: 

I would suggest that you do not have a byte offset into a standard Java String. If indeed you do, can yu tell us who you got it (code please)

John Allen