Java, not unreasonably, encodes Unicode characters into native system encoded bytes before it writes them to stdout
. Some operating systems, like many Linux distros, use UTF-8
as their default character set, which is nice.
Things are a bit different on Windows for a variety of backwards-compatibility reasons. The default system encoding will be one of the "ANSI" codepages and if you open the default command prompt (cmd.exe) it will be one of the old "OEM" DOS codepages (though it is possible to get ANSI and Unicode there with a bit of work).
Since U+0308 isn't in any of the "ANSI" character sets (probably 1252 in your case), it'll get encoded as an error character (usually a question mark).
An alternative to Unicode-enabling everything is to normalize the combining sequence U+0069 U+0308 to the single character U+00EF:
public static void emit(String foo) throws IOException {
System.out.println("Literal: " + foo);
System.out.print("Hex: ");
for (char ch : foo.toCharArray()) {
System.out.print(Integer.toHexString(ch & 0xFFFF) + " ");
}
System.out.println();
}
public static void main(String[] args) throws IOException {
String foo = "\u0069\u0308";
emit(foo);
foo = Normalizer.normalize(foo, Normalizer.Form.NFC);
emit(foo);
}
Under windows-1252
, this code will emit:
Literal: i?
Hex: 69 308
Literal: ï
Hex: ef