ansaurus

Question

Java 1.6 Windows-1252 encoding fails on 3 characters

Answer 1

+3 A:

In my opinion the testing program is deeply flawed, because it makes effectively useless transformations between Strings with no semantic meaning.

If you want to check if all byte values are valid values for a given encoding, then something like this might be more like it:

public static void tryEncoding(final String encoding) throws UnsupportedEncodingException {
    int badCount = 0;
    for (int i = 1; i < 255; i++) {
        byte[] bytes = new byte[] { (byte) i };

        String toString = new String(bytes, encoding);
        byte[] fromString = toString.getBytes(encoding);

        if (!Arrays.equals(bytes, fromString)) {
            System.out.println("Can't encode: " + i + " - in: " + Arrays.toString(bytes) + "/ out: "
                    + Arrays.toString(fromString) + " - result: " + toString);
            badCount++;
        }
    }

    System.out.println("Bad count: " + badCount);
}

Note that this testing program tests inputs using the (usnigned) byte values from 1 to 255. The code in the question uses the char values (equivalent to Unicode codepoints in this range) from 1 to 255.

Try printing the actual byte arrays handled by the program in the example and you see that you're not actually checking all byte values and that some of your "bad" matches are duplicates of others.

Running this with "Windows-1252" as the argument produces this output:

Can't encode: 129 - in: [-127]/ out: [63] - result: �
Can't encode: 141 - in: [-115]/ out: [63] - result: �
Can't encode: 143 - in: [-113]/ out: [63] - result: �
Can't encode: 144 - in: [-112]/ out: [63] - result: �
Can't encode: 157 - in: [-99]/ out: [63] - result: �
Bad count: 5

Which tells us that Windows-1252 doesn't accept the byte values 129, 1441, 143, 144 and 157 as valid values. (Note: I'm talking about unsigned byte values here. The code above shows -127, -115, ... because Java only knows unsigned bytes).

The Wikipedia article on Windows-1252 seems to verify this observation by stating this:

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused

Joachim Sauer 2010-01-27 15:12:59

Joachim, thanks for this test. Notice that characters 193, 205, and 207 are not in your output above. Why are they not encoding properly in Windows-1252, but they do in Latin1? That code maps to the same character in both codepages.

James Cooper 2010-01-27 15:23:35

@James: "Why are they not encoding properly in Windows-1252" is the wrong question. The character U+00C1 (codepoint 193) is represented as 0xC3 0x81 in UTF-8. When you try to interpret those bytes as Windows-1252, then you'll notice that 0x81 is not a valid value for Windows-1252 and will be replaced with a replacement character.

Joachim Sauer 2010-01-27 16:46:44

That makes sense. Thank you. I need to open a new question, as this one is confusing the issue. My apologies.

James Cooper 2010-01-27 16:48:39

Answer 2

+2 A:

What your code does (String->byte[]->String, twice) is pretty much the opposite of transcoding, and makes no sense at all (it's virtually guaranteed to lose data). Transcoding means byte[]->String->byte[]:

public byte[] transcode(byte[] input, String inputEnc, String targetEnc)
{
    return new String(input, inputEnc).getBytes(targetEnc);
}

And of course, it will lose data when the input contains characters that the target encoding does not support.

Michael Borgwardt 2010-01-27 15:25:13

Not sure how this differs from my example. Could you post an example that demonstrates that this actually transcodes between encodings? My tests indicate that code does exactly what mine does. If you have a byte array encoded in UTF-8, and pass in "Windows-1252" as the target encoding, you won't get back a properly encoded string -- you'll get gibberish.See my Charset transcode() implementation. I think that's what we're after.

James Cooper 2010-01-27 16:14:57

@James it seems you harbor some misconceptions as to what Java strings are. They're *decoded* characters (using UTF-16 internally, but that is irrelevant here). You cannot decode a string. Byte arrays are decoded to Strings, and Strings are encoded to byte arrays. Transcoding starts and ends with byte arrays, because a byte array is a concrete, encoding-dependant representation of an abstract string.

Michael Borgwardt 2010-01-27 17:15:18

@Michael. Thank you. I am maintaining an app where the String was created improperly upstream (in some DAO code, due to data stored improperly in MySQL). Raw bytes were UTF-8, but the String was created with Windows-1252. My goal was to take a Java string, which is all I have at this point, and somehow transmogrify it so it's not gibberish. I realize I'm not solving root cause, etc, but tis our plight sometimes in maintenance engineering. Jochaim's answer that 0x81 is not defined in Windows-1252 explains why I cannot recover that character.

James Cooper 2010-01-27 17:57:40

@James Ah, now I understand the problem; that's a rather nasty situation to fix.

Michael Borgwardt 2010-01-27 21:46:23

ansaurus

tags:

views:

answers:

Java 1.6 Windows-1252 encoding fails on 3 characters

related questions