views:

738

answers:

3

So I know about String#codePointAt(int), but it's indexed by the char offset, not by the codepoint offset.

I'm thinking about trying something like:

  • using String#charAt(int) to get the char at an index
  • testing whether the char is in the high-surrogates range
    • if so, use String#codePointAt(int) to get the codepoint, and increment the index by 2
    • if not, use the given char value as the codepoint, and increment the index by 1

But my concerns are

  • I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two char values or one
  • this seems like an awful expensive way to iterate through characters
  • someone must have come up with something better.
+3  A: 

Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the BMP using the surrogacy scheme.

If you know you'll be dealing with characters oustide the BMP, then here is the canonical way to iterate over the characters of a Java String:

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}
Jonathan Feinberg
As for whether or not it's "expensive", well... there is no other way built into Java. But if you're dealing only with Latin/European/Cyrillic/Greek/Hebrew/Arabic scripts, then you just s.charAt() to your heart's content. :)
Jonathan Feinberg
+3  A: 

Iterating over code points is filed as a feature request at Sun.

See Sun Bug Entry

There is also an example on how to iterate over String CodePoints there.

alexander.egger
+1  A: 
  • I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two char values or one

They are represented in a String as two characters.

  • this seems like an awful expensive way to iterate through characters
  • someone must have come up with something better.

There is no better way (than @e.e's solution) that integrates nicely with the Java language / libraries as they are currently specified.

In theory, you could build a String32 == "string as a sequence of Unicode codepoints" class. In practice it would more pain than it is worth. All of the standard Java APIs (and 3rd-party libraries) require String and assume 16 bit characters. To use your new class, you'd either need to replace many APIs with versions that use String32, or do lots of String <-> String32 conversions in your code.

Stephen C
I checked later, and found that [there are no valid codepoints in the U+D800–U+DFFF range](http://en.wikipedia.org/wiki/UTF-16#Encoding_of_characters_outside_the_BMP), so there's no ambiguity at all.
rampion