views:

1953

answers:

4

I need to encode/decode UTF-16 byte arrays to and from java.lang.String. The byte arrays are given to me with a Byte Order Marker (BOM), and I need to encoded byte arrays with a BOM.

Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world.

As an example, here is a method which encodes a java.lang.String as UTF-16 in little endian with a BOM:

public static byte[] encodeString(String message) {

    byte[] tmp = null;
    try {
        tmp = message.getBytes("UTF-16LE");
    } catch(UnsupportedEncodingException e) {
        // should not possible
        AssertionError ae =
        new AssertionError("Could not encode UTF-16LE");
        ae.initCause(e);
        throw ae;
    }

    // use brute force method to add BOM
    byte[] utf16lemessage = new byte[2 + tmp.length];
    utf16lemessage[0] = (byte)0xFF;
    utf16lemessage[1] = (byte)0xFE;
    System.arraycopy(tmp, 0,
                     utf16lemessage, 2,
                     tmp.length);
    return utf16lemessage;
}

What is the best way to do this in Java? Ideally I'd like to avoid copying the entire byte array into a new byte array that has two extra bytes allocated at the beginning.

The same goes for decoding such a string, but that's much more straightforward by using the java.lang.String constructor:

public String(byte[] bytes,
              int offset,
              int length,
              String charsetName)
+7  A: 

The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and "UnicodeLittle" are useful for encoding in a specific byte order. Use UTF-16LE or UTF-16BE for no BOM - see this post for how to use "\uFEFF" to handle BOMs manually. See here for canonical naming of charset string names or (preferably) the Charset class. Also take note that only a limited subset of encodings are absolutely required to be supported.

McDowell
Thanks! One more issue though... Using "UTF-16" encodes the data as Big Endian, which I suspect will not go over well with Microsoft data (even though the BOM exists). Any way to encode UTF-16LE with BOM with Java? I'll update my question to reflect what I was really looking for...
Jared Oberhaus
Click on the "see this post" link he gave. Basically, you stuff a \uFEFF character at the beginning of your string, and then encode to UTF-16LE, and the result will have a proper BOM.
Daniel Martin
Use "UnicodeLittle" (assuming your JRE supports it - ("\uEFFF" + "my string").getBytes("UTF-16LE") otherwise). Though I would be surprised if Microsoft APIs expected a BOM but couldn't handle big-endian data - they tend to like using BOMs more than other platforms. Test with empty strings - you may get empty arrays if there is no data.
McDowell
I would be completely unsurprised at Microsoft defining a format where it expects a UTF-16LE BOM to begin a file and will not behave if the file begins with a UTF-8 BOM or a UTF-16BE BOM.I would be completely unsurprised because this is exactly the behavior I have observed with excel loading CSV files - if the file begins with a UTF-16LE BOM, then it loads the data in UTF-16LE and expects tabs between columns. Any other character sequence and it loads data in some local character set with "," or ";" (locale-dependent!) between columns.
Daniel Martin
Thanks for the Excel anecdote, @Daniel Martin. Exactly the kind of behavior I don't want to discover. :)
Jared Oberhaus
Just to reiterate: "UnicodeLittle" (a.k.a. "x-UTF-16LE-BOM") will write the file as UTF-16 little-endian with a BOM. This should be the preferred method for WRITING the files, but it only seems to be available since Java 6 (JDK 1.6). For READING, you should stick with "UTF-16".
Alan Moore
+2  A: 
    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(string.length() * 2 + 2);
    byteArrayOutputStream.write(new byte[]{(byte)0xFF,(byte)0xFE});
    byteArrayOutputStream.write(string.getBytes("UTF-16LE"));
    return byteArrayOutputStream.toByteArray();

EDIT: Rereading your question, I see you would rather avoid the double array allocation altogether. Unfortunately the API doesn't give you that, as far as I know. (There was a method, but it is deprecated, and you can't specify encoding with it).

I wrote the above before I saw your comment, I think the answer to use the nio classes is on the right track. I was looking at that, but I'm not familiar enough with the API to know off hand how you get that done.

Yishai
Thanks. In addition what I would have liked here is to not allocate the entire byte array with string.getBytes("UTF-16LE")--perhaps by wrapping the stream as an InputStream, which was the point of my earlier question: http://stackoverflow.com/questions/837703/how-can-i-get-a-java-io-inputstream-from-a-java-lang-string
Jared Oberhaus
Note that this code actually allocates arrays big enough for the String three times, since you have the internal array of the ByteArrayOutputStream which is copied in the call .toByteArray().A way to get it back down to only allocating two is to wrap the ByteArrayOutputStream in an OutputStreamWriter and write the string to that. Then you still have the ByteArrayOutputStream's internal state and the copy made by .toByteArray(), but not the return value from .getBytes
Daniel Martin
It seems that you are just exchanging a char array for a byte array if you do that, as the OutputStreamWriter delegates to the StreamEncoder class, which creates a char[] buffer to retrieve the String data. String is immutable, and the size of an array is invariable, so that copy seems unavoidable. I think nio is supposed to help with that double creation on the ByteArrayOutputStream
Yishai
+3  A: 

First off, for decoding you can use the character set "UTF-16"; that automatically detects an initial BOM. For encoding UTF-16BE, you can also use the "UTF-16" character set - that'll write a proper BOM and then output big endian stuff.

For encoding to little endian with a BOM, I don't think your current code is too bad, even with the double allocation (unless your strings are truly monstrous). What you might want to do if they are is not deal with a byte array but rather a java.nio ByteBuffer, and use the java.nio.charset.CharsetEncoder class. (Which you can get from Charset.forName("UTF-16LE").newEncoder()).

Daniel Martin
Thanks, good advice.
Jared Oberhaus
+1  A: 

This is how you do it in nio:

    return Charset.forName("UTF-16LE").encode(message)
            .put(0, (byte) 0xFF)
            .put(1, (byte) 0xFE)
            .array();

It is certainly supposed to be faster, but I don't know how many arrays it makes under the covers, but my understanding of the point of the API is that it is supposed to minimize that.

Yishai