ansaurus

Question

How can I iterate through the unicode codepoints of a Java String?

Answer 1

+3 A:

Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the BMP using the surrogacy scheme.

If you know you'll be dealing with characters oustide the BMP, then here is the canonical way to iterate over the characters of a Java String:

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}

Jonathan Feinberg 2009-10-06 20:21:35

As for whether or not it's "expensive", well... there is no other way built into Java. But if you're dealing only with Latin/European/Cyrillic/Greek/Hebrew/Arabic scripts, then you just s.charAt() to your heart's content. :)

Jonathan Feinberg 2009-10-06 20:25:32

Answer 2

+3 A:

Iterating over code points is filed as a feature request at Sun.

See Sun Bug Entry

There is also an example on how to iterate over String CodePoints there.

alexander.egger 2009-10-06 20:22:01

Answer 3

+1 A:

I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two char values or one

They are represented in a String as two characters.

this seems like an awful expensive way to iterate through characters
someone must have come up with something better.

There is no better way (than @e.e's solution) that integrates nicely with the Java language / libraries as they are currently specified.

In theory, you could build a String32 == "string as a sequence of Unicode codepoints" class. In practice it would more pain than it is worth. All of the standard Java APIs (and 3rd-party libraries) require String and assume 16 bit characters. To use your new class, you'd either need to replace many APIs with versions that use String32, or do lots of String <-> String32 conversions in your code.

Stephen C 2009-10-06 23:53:25

I checked later, and found that [there are no valid codepoints in the U+D800–U+DFFF range](http://en.wikipedia.org/wiki/UTF-16#Encoding_of_characters_outside_the_BMP), so there's no ambiguity at all.

rampion 2009-10-07 02:33:08

ansaurus

tags:

views:

answers:

How can I iterate through the unicode codepoints of a Java String?

related questions