ansaurus

Question

How do I encode/decode UTF-16LE byte arrays with a BOM?

Answer 1

+7 A:

The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and "UnicodeLittle" are useful for encoding in a specific byte order. Use UTF-16LE or UTF-16BE for no BOM - see this post for how to use "\uFEFF" to handle BOMs manually. See here for canonical naming of charset string names or (preferably) the Charset class. Also take note that only a limited subset of encodings are absolutely required to be supported.

McDowell 2009-05-18 20:08:45

Thanks! One more issue though... Using "UTF-16" encodes the data as Big Endian, which I suspect will not go over well with Microsoft data (even though the BOM exists). Any way to encode UTF-16LE with BOM with Java? I'll update my question to reflect what I was really looking for...

Jared Oberhaus 2009-05-18 20:14:13

Click on the "see this post" link he gave. Basically, you stuff a \uFEFF character at the beginning of your string, and then encode to UTF-16LE, and the result will have a proper BOM.

Daniel Martin 2009-05-18 20:17:55

Use "UnicodeLittle" (assuming your JRE supports it - ("\uEFFF" + "my string").getBytes("UTF-16LE") otherwise). Though I would be surprised if Microsoft APIs expected a BOM but couldn't handle big-endian data - they tend to like using BOMs more than other platforms. Test with empty strings - you may get empty arrays if there is no data.

McDowell 2009-05-18 20:22:48

I would be completely unsurprised at Microsoft defining a format where it expects a UTF-16LE BOM to begin a file and will not behave if the file begins with a UTF-8 BOM or a UTF-16BE BOM.I would be completely unsurprised because this is exactly the behavior I have observed with excel loading CSV files - if the file begins with a UTF-16LE BOM, then it loads the data in UTF-16LE and expects tabs between columns. Any other character sequence and it loads data in some local character set with "," or ";" (locale-dependent!) between columns.

Daniel Martin 2009-05-18 20:42:37

Thanks for the Excel anecdote, @Daniel Martin. Exactly the kind of behavior I don't want to discover. :)

Jared Oberhaus 2009-05-18 23:07:30

Just to reiterate: "UnicodeLittle" (a.k.a. "x-UTF-16LE-BOM") will write the file as UTF-16 little-endian with a BOM. This should be the preferred method for WRITING the files, but it only seems to be available since Java 6 (JDK 1.6). For READING, you should stick with "UTF-16".

Alan Moore 2009-05-18 23:51:30

Answer 2

+2 A:

    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(string.length() * 2 + 2);
    byteArrayOutputStream.write(new byte[]{(byte)0xFF,(byte)0xFE});
    byteArrayOutputStream.write(string.getBytes("UTF-16LE"));
    return byteArrayOutputStream.toByteArray();

EDIT: Rereading your question, I see you would rather avoid the double array allocation altogether. Unfortunately the API doesn't give you that, as far as I know. (There was a method, but it is deprecated, and you can't specify encoding with it).

I wrote the above before I saw your comment, I think the answer to use the nio classes is on the right track. I was looking at that, but I'm not familiar enough with the API to know off hand how you get that done.

Yishai 2009-05-18 20:09:49

Thanks. In addition what I would have liked here is to not allocate the entire byte array with string.getBytes("UTF-16LE")--perhaps by wrapping the stream as an InputStream, which was the point of my earlier question: http://stackoverflow.com/questions/837703/how-can-i-get-a-java-io-inputstream-from-a-java-lang-string

Jared Oberhaus 2009-05-18 20:21:23

Note that this code actually allocates arrays big enough for the String three times, since you have the internal array of the ByteArrayOutputStream which is copied in the call .toByteArray().A way to get it back down to only allocating two is to wrap the ByteArrayOutputStream in an OutputStreamWriter and write the string to that. Then you still have the ByteArrayOutputStream's internal state and the copy made by .toByteArray(), but not the return value from .getBytes

Daniel Martin 2009-05-18 20:55:29

It seems that you are just exchanging a char array for a byte array if you do that, as the OutputStreamWriter delegates to the StreamEncoder class, which creates a char[] buffer to retrieve the String data. String is immutable, and the size of an array is invariable, so that copy seems unavoidable. I think nio is supposed to help with that double creation on the ByteArrayOutputStream

Yishai 2009-05-18 21:29:41

Answer 3

+3 A:

First off, for decoding you can use the character set "UTF-16"; that automatically detects an initial BOM. For encoding UTF-16BE, you can also use the "UTF-16" character set - that'll write a proper BOM and then output big endian stuff.

For encoding to little endian with a BOM, I don't think your current code is too bad, even with the double allocation (unless your strings are truly monstrous). What you might want to do if they are is not deal with a byte array but rather a java.nio ByteBuffer, and use the java.nio.charset.CharsetEncoder class. (Which you can get from Charset.forName("UTF-16LE").newEncoder()).

Daniel Martin 2009-05-18 20:15:47

Thanks, good advice.

Jared Oberhaus 2009-05-18 20:17:24

Answer 4

+1 A:

This is how you do it in nio:

    return Charset.forName("UTF-16LE").encode(message)
            .put(0, (byte) 0xFF)
            .put(1, (byte) 0xFE)
            .array();

It is certainly supposed to be faster, but I don't know how many arrays it makes under the covers, but my understanding of the point of the API is that it is supposed to minimize that.

Yishai 2009-05-18 23:09:56

ansaurus

tags:

views:

answers:

How do I encode/decode UTF-16LE byte arrays with a BOM?

related questions