ansaurus

Question

How can I decode UTF-16 data in Perl when I don't know the byte order?

Answer 1

+2 A:

You need to specify either UTF-16BE or UTF-16LE. See http://perldoc.perl.org/Encode/Unicode.html#Size%2c-Endianness%2c-and-BOM

Snake Plissken 2010-05-22 12:19:21

Answer 2

A:

What you're trying to do impossible.

You're reading lines of text without specifying an encoding, so every byte that contains a newline character (default \x0a) ends a line. But this newline character may very well be in the middle of an UTF-16 character, in which case your next line can't be decoded. If your data is UTF-16LE, this will happen all the time – line feeds are \x0a \x00. If you have UTF16-BE, you might get lucky (newlines are \x00 \x0a), until you get a character with \x0a in the high byte.

So, don't do that, open the file in the right encoding.

mscha 2010-05-22 14:10:04

What if you don't always have a file, and you only get passed a string?

Geo 2010-05-22 15:06:25

It's not impossible: see my answer for how you should handle incomplete byte sequences.

brian d foy 2010-05-22 15:31:39

Answer 3

+2 A:

If you simply specify "UTF-16", Perl is going to look for the byte-order mark (BOM) to figure out how to parse it. If there is no BOM, it's going to blow up. In that case, you have to tell Encode which byte-order you have by specifying either "UTF-16LE" for little-endian or "UTF-16BE" for big-endian.

There's something else going on with your situation though, but it's hard to tell without seeing the data you have in the file. I get the same error with both snippets. If I don't have a BOM and I don't specify a byte order, my Perl complains either way. Which Perl are you using and which platform do you have? Does your platform have the native endianness of your file? I think the behaviour I see is correct according to the docs.

Also, you can't simply read a line in some unknown encoding (whatever Perl's default is) then ship that off to decode. You might end up in the middle of a multi-byte sequence. You have to use Encode::FB_QUIET to save the part of the buffer that you couldn't decode and add that to the next chunk of data:

 open my($lefh), '<:raw', 'text-utf16.txt';

my $string;
while( $string .= <$lefh> ) {
    print decode("UTF-16LE", $string, Encode::FB_QUIET) 
    }

brian d foy 2010-05-22 15:30:24

You know, if I concatenate the strings into one large buffer, I can use decode on it succesfully.

Geo 2010-05-22 16:34:47

You can decode the whole thing at once because it sees the BOM for the whole string. Breaking it up into individual lines means the BOM is only for the first chunk. Encode doesn't do anything special to try to guess that one string is somehow related to another.

brian d foy 2010-05-23 07:28:47

ansaurus

tags:

views:

answers:

How can I decode UTF-16 data in Perl when I don't know the byte order?

related questions