Here's an excerpt from java.text.CharacterIterator
documentation:
This
interface
defines a protocol for bidirectional iteration over text. The iterator iterates over a bounded sequence of characters. [...] The methodsprevious()
andnext()
are used for iteration. They returnDONE
if [...], signaling that the iterator has reached the end of the sequence.
static final char DONE
: Constant that is returned when the iterator has reached either the end or the beginning of the text. The value is\uFFFF
, the "not a character" value which should not occur in any valid Unicode string.
The italicized part is what I'm having trouble understanding, because from my tests, it looks like a Java String
can most certainly contain \uFFFF
, and there doesn't seem to be any problem with it, except obviously with the prescribed CharacterIterator
traversal idiom that breaks because of a false positive (e.g. next()
returns '\uFFFF' == DONE
when it's not really "done").
Here's a snippet to illustrate the "problem" (see also on ideone.com):
import java.text.*;
public class CharacterIteratorTest {
// this is the prescribed traversal idiom from the documentation
public static void traverseForward(CharacterIterator iter) {
for(char c = iter.first(); c != CharacterIterator.DONE; c = iter.next()) {
System.out.print(c);
}
}
public static void main(String[] args) {
String s = "abc\uFFFFdef";
System.out.println(s);
// abc?def
System.out.println(s.indexOf('\uFFFF'));
// 3
traverseForward(new StringCharacterIterator(s));
// abc
}
}
So what is going on here?
- Is the prescribed traversal idiom "broken" because it makes the wrong assumption about
\uFFFF
? - Is the
StringCharacterIterator
implementation "broken" because it doesn't e.g.throw
anIllegalArgumentException
if in fact\uFFFF
is forbidden in valid Unicode strings? - Is it actually true that valid Unicode strings should not contain
\uFFFF
? - If that's true, then is Java "broken" for violating the Unicode specification by (for the most parts) allowing
String
to contain\uFFFF
anyway?