tags:

views:

824

answers:

2

Consider the following code:

byte aBytes[] = { (byte)0xff,0x01,0,0,
                  (byte)0xd9,(byte)0x65,
                  (byte)0x03,(byte)0x04, (byte)0x05, (byte)0x06, (byte)0x07,
                  (byte)0x17,(byte)0x33, (byte)0x74, (byte)0x6f,
                   0, 1, 2, 3, 4, 5,
                   0 };
String sCompressedBytes = new String(aBytes, "UTF-16");
for (int i=0; i<sCompressedBytes.length; i++) {
    System.out.println(Integer.toHexString(sCompressedBytes.codePointAt(i)));
}

Gets the following incorrect output:

ff01, 0, fffd, 506, 717, 3374, 6f00, 102, 304, 500.

However, if the 0xd9 in the input data is changed to 0x9d, then the following correct output is obtained:

ff01, 0, 9d65, 304, 506, 717, 3374, 6f00, 102, 304, 500.

I realize that the functionality is because of the fact that the byte 0xd9 is a high-surrogate Unicode marker.

Question: Is there a way to feed, identify and extract surrogate bytes (0xd800 to 0xdfff) in a Java Unicode string?
Thanks

+8  A: 

EDIT: This addresses the question from the comment

If you want to encode arbitrary binary data in a string, you should not use a normal text encoding. You don't have valid text in that encoding - you just have arbitrary binary data.

Base64 is the way to go here. There's no base64 support directly in Java (in a public class, anyway) but there are various 3rd party libraries you can use, such as the one in the Apache Commons Codec library.

Yes, base64 will increase the size of the data - but it'll allow you to decode it later without losing information.

EDIT: This addresses the original question

I believe that the problem is that you haven't specified a proper surrogate pair. You should specify bytes representing a low surrogate and then a high surrogate. After that, you should be able to extra the appropriate code point. In your case, you've given a low surrogate on its own.

Here's code to demonstrate this:

public class Test
{
    public static void main(String[] args)
        throws Exception // Just for simplicity
    {
        byte[] data = 
        {
            0, 0x41, // A
            (byte) 0xD8, 1, // High surrogate
            (byte) 0xDC, 2, // Low surrogate
            0, 0x42, // B
        };

        String text = new String(data, "UTF-16");

        System.out.printf("%x\r\n", text.codePointAt(0));
        System.out.printf("%x\r\n", text.codePointAt(1));
        // Code point at 2 is part of the surrogate pair
        System.out.printf("%x\r\n", text.codePointAt(3));       
    }
}

Output:

41
10402
42
Jon Skeet
I believe you're right. I had just come to the same conclusion but checked back to see if anyone more knowledgeable had already answered.
Michael Myers
Simply inserting "(byte) 0xdc, (byte) 0xef," yields "ff010694efdcef304..." Which is as it should be.
Michael Myers
Thanks for your answers. But, the problem is not about embedding surrogate characters. The requirement is to feed any arbitrary byte sequence(which are output from compressing) into a Java string and to read it back as an equivalent byte sequence.
VSK
If you'd made that a separate answer, I'd have upvoted it. Now I'm stuck with a single upvote for two good answers. (But it's not like upvotes matter too much to you at this point in the day, right?)
Michael Myers
@mmyers: Indeed. It didn't feel like it was worth giving an extra answer...
Jon Skeet
+1  A: 

Is there a way to feed, identify and extract surrogate bytes (0xd800 to 0xdfff) in a Java Unicode string?

Just because no one has mentioned it, I'll point out that the Character class includes the methods for working with surrogate pairs. E.g. isHighSurrogate(char), codePointAt(CharSequence, int) and toChars(int). I realise that this is besides the point of the stated problem.

new String(aBytes, "UTF-16");

This is a decoding operation that will transform the input data. I'm pretty sure it is not legal because the chosen decoding operation requires the input to start with either 0xfe 0xff or 0xff 0xfe (the byte order mark). In addition, not every possible byte value can be decoded correctly because UTF-16 is a variable width encoding.

If you wanted a symmetric transformation of arbitrary bytes to String and back, you are better off with an 8-bit, single-byte encoding because every byte value is a valid character:

Charset iso8859_15 = Charset.forName("ISO-8859-15");
byte[] data = new byte[256];
for (int i = Byte.MIN_VALUE; i <= Byte.MAX_VALUE; i++) {
  data[i - Byte.MIN_VALUE] = (byte) i;
}
String asString = new String(data, iso8859_15);
byte[] encoded = asString.getBytes(iso8859_15);
System.out.println(Arrays.equals(data, encoded));

Note: the number of characters is going to equal the number of bytes (doubling the size of the data); the resultant string isn't necessarily going to be printable (containing as it might, a bunch of control characters).

I'm with Jon, though - putting arbitrary byte sequences into Java strings is almost always a bad idea.

McDowell