Consider the following code:
byte aBytes[] = { (byte)0xff,0x01,0,0,
(byte)0xd9,(byte)0x65,
(byte)0x03,(byte)0x04, (byte)0x05, (byte)0x06, (byte)0x07,
(byte)0x17,(byte)0x33, (byte)0x74, (byte)0x6f,
0, 1, 2, 3, 4, 5,
0 };
String sCompressedBytes = new String(aBytes, "UTF-16");
for (int i=0; i<sCompressedBytes.length; i++) {
System.out.println(Integer.toHexString(sCompressedBytes.codePointAt(i)));
}
Gets the following incorrect output:
ff01, 0, fffd, 506, 717, 3374, 6f00, 102, 304, 500.
However, if the 0xd9
in the input data is changed to 0x9d
, then the following correct output is obtained:
ff01, 0, 9d65, 304, 506, 717, 3374, 6f00, 102, 304, 500.
I realize that the functionality is because of the fact that the byte 0xd9
is a high-surrogate Unicode marker.
Question: Is there a way to feed, identify and extract surrogate bytes (0xd800
to 0xdfff
) in a Java Unicode string?
Thanks