ansaurus

Question

Answer 1

+1 A:

You can’t really know what character encoding has been used (unless you created the tool that created the output you’re processing). You can try to detect a list of pre-defined encodings and choose the one that does not result in any decoding errors but depending on the input that might match a lot of different encodings.

Bombe 2009-01-19 14:09:27

Answer 2

A:

If you don't know beforehand the character encoding and this is different among various platforms, then you need to somehow analyze the byte array to try to guess it. There are some detecting algorithms available, but it may be an overkill for your application.

Can you tweak your application to produce a known output? No need to be a full line, only the first characters will do. If yes, then you could compare the produced byte array with the expected in various encodings and do the detecting. The byte arrays of UTF8, UTF-16 big and little endian will be different event for simple strings.

kgiannakakis 2009-01-19 14:21:55

Answer 3

+1 A:

You could try to use a library to guess the encoding, for instance I have once used this solution.

Fabian Steeg 2009-01-19 14:59:19

Answer 4

+3 A:

Some Microsoft applications use a byte-order mark to indicate Unicode files and their endianness. I can see on my Windows XP machine that the exported .NFO file starts with 0xFFFE, so it is little-endian.

FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00         __<_?_x_m_l_ _v_
65 00 72 00 73 00 69 00 6F 00 6E 00 3D 00 22 00         e_r_s_i_o_n_=_"_
31 00 2E 00 30 00 22 00 3F 00 3E 00 0D 00 0A 00         1_._0_"_?_>_____
3C 00 4D 00 73 00 49 00 6E 00 66 00 6F 00 3E 00         <_M_s_I_n_f_o_>_
0D 00 0A 00 3C 00 4D 00 65 00 74 00 61 00 64 00         ____<_M_e_t_a_d_

Also, I recommend you switch to using Reader implementations rather than the String constructor for decoding files; this helps avoid problems where you read half a character because it is truncated because it is sitting at the end of a byte array.

McDowell 2009-01-19 14:59:28

Answer 5

A:

The way it's supposed to work is, if someone gives you a file and says it's UTF-16, they expect you to examine the first two bytes (the BOM) to find out whether it's big-endian or little-endian. But if they tell you the encoding is UTF-16LE, it means there's no BOM; you don't need it because they've already told you the byte order is little-endian. Java follows these rules precisely, which is a real pisser because nobody else does.

The native character encoding of modern Windows operating systems is UTF-16, little-endian. Unfortunately, individual programs don't seem to be consistent when it comes to byte-order marks. And you can't just use UTF-16LE all the time because, if the BOM is there, it will be passed through as a junk character. The only way to know ahead of time whether to use UTF-16 or UTF-16LE is to examine the first two bytes, as McDowell described.

Alan Moore 2009-01-19 19:03:24

ansaurus

tags:

views:

answers:

Java charset and Windows

related questions