views:

596

answers:

2

EDIT: I've been convinced that this question is somewhat non-sensical. Thanks to those who responded. I may post a follow-up question that is more specific.

Today I was investing some encoding problems and wrote this unit test to isolate a base repro case:

int badCount = 0;
for (int i = 1; i < 255; i++) {
    String str = "Hi " + new String(new char[] { (char) i });

    String toLatin1  = new String(str.getBytes("UTF-8"), "latin1");
    assertEquals(str, new String(toLatin1.getBytes("latin1"), "UTF-8"));

    String toWin1252 = new String(str.getBytes("UTF-8"), "Windows-1252");
    String fromWin1252 = new String(toWin1252.getBytes("Windows-1252"), "UTF-8");

    if (!str.equals(fromWin1252)) {
        System.out.println("Can't encode: " + i + " - " + str + 
                           " - encodes as: " + fromWin1252);
        badCount++;
    }
}

System.out.println("Bad count: " + badCount);

The output:

    Can't encode: 129 - Hi ? - encodes as: Hi ??
    Can't encode: 141 - Hi ? - encodes as: Hi ??
    Can't encode: 143 - Hi ? - encodes as: Hi ??
    Can't encode: 144 - Hi ? - encodes as: Hi ??
    Can't encode: 157 - Hi ? - encodes as: Hi ??
    Can't encode: 193 - Hi Á - encodes as: Hi ??
    Can't encode: 205 - Hi Í - encodes as: Hi ??
    Can't encode: 207 - Hi Ï - encodes as: Hi ??
    Can't encode: 208 - Hi ? - encodes as: Hi ??
    Can't encode: 221 - Hi ? - encodes as: Hi ??
    Bad count: 10

JDK 1.6.0_07 on Mac OS 10.6.2

My observation:

Latin1 symmetrically encodes all 254 characters. Windows-1252 does not. The three printable characters (193, 205, 207) are the same codes in Latin1 and Windows-1252, so I wouldn't expect any issues.

Can anyone explain this behavior? Is this a JDK bug?

-- James

+3  A: 

In my opinion the testing program is deeply flawed, because it makes effectively useless transformations between Strings with no semantic meaning.

If you want to check if all byte values are valid values for a given encoding, then something like this might be more like it:

public static void tryEncoding(final String encoding) throws UnsupportedEncodingException {
    int badCount = 0;
    for (int i = 1; i < 255; i++) {
        byte[] bytes = new byte[] { (byte) i };

        String toString = new String(bytes, encoding);
        byte[] fromString = toString.getBytes(encoding);

        if (!Arrays.equals(bytes, fromString)) {
            System.out.println("Can't encode: " + i + " - in: " + Arrays.toString(bytes) + "/ out: "
                    + Arrays.toString(fromString) + " - result: " + toString);
            badCount++;
        }
    }

    System.out.println("Bad count: " + badCount);
}

Note that this testing program tests inputs using the (usnigned) byte values from 1 to 255. The code in the question uses the char values (equivalent to Unicode codepoints in this range) from 1 to 255.

Try printing the actual byte arrays handled by the program in the example and you see that you're not actually checking all byte values and that some of your "bad" matches are duplicates of others.

Running this with "Windows-1252" as the argument produces this output:

Can't encode: 129 - in: [-127]/ out: [63] - result: �
Can't encode: 141 - in: [-115]/ out: [63] - result: �
Can't encode: 143 - in: [-113]/ out: [63] - result: �
Can't encode: 144 - in: [-112]/ out: [63] - result: �
Can't encode: 157 - in: [-99]/ out: [63] - result: �
Bad count: 5

Which tells us that Windows-1252 doesn't accept the byte values 129, 1441, 143, 144 and 157 as valid values. (Note: I'm talking about unsigned byte values here. The code above shows -127, -115, ... because Java only knows unsigned bytes).

The Wikipedia article on Windows-1252 seems to verify this observation by stating this:

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused

Joachim Sauer
Joachim, thanks for this test. Notice that characters 193, 205, and 207 are not in your output above. Why are they not encoding properly in Windows-1252, but they do in Latin1? That code maps to the same character in both codepages.
James Cooper
@James: "Why are they not encoding properly in Windows-1252" is the wrong question. The character U+00C1 (codepoint 193) is represented as 0xC3 0x81 in UTF-8. When you try to interpret those bytes as Windows-1252, then you'll notice that 0x81 is not a valid value for Windows-1252 and will be replaced with a replacement character.
Joachim Sauer
That makes sense. Thank you. I need to open a new question, as this one is confusing the issue. My apologies.
James Cooper
+2  A: 

What your code does (String->byte[]->String, twice) is pretty much the opposite of transcoding, and makes no sense at all (it's virtually guaranteed to lose data). Transcoding means byte[]->String->byte[]:

public byte[] transcode(byte[] input, String inputEnc, String targetEnc)
{
    return new String(input, inputEnc).getBytes(targetEnc);
}

And of course, it will lose data when the input contains characters that the target encoding does not support.

Michael Borgwardt
Not sure how this differs from my example. Could you post an example that demonstrates that this actually transcodes between encodings? My tests indicate that code does exactly what mine does. If you have a byte array encoded in UTF-8, and pass in "Windows-1252" as the target encoding, you won't get back a properly encoded string -- you'll get gibberish.See my Charset transcode() implementation. I think that's what we're after.
James Cooper
@James it seems you harbor some misconceptions as to what Java strings are. They're *decoded* characters (using UTF-16 internally, but that is irrelevant here). You cannot decode a string. Byte arrays are decoded to Strings, and Strings are encoded to byte arrays. Transcoding starts and ends with byte arrays, because a byte array is a concrete, encoding-dependant representation of an abstract string.
Michael Borgwardt
@Michael. Thank you. I am maintaining an app where the String was created improperly upstream (in some DAO code, due to data stored improperly in MySQL). Raw bytes were UTF-8, but the String was created with Windows-1252. My goal was to take a Java string, which is all I have at this point, and somehow transmogrify it so it's not gibberish. I realize I'm not solving root cause, etc, but tis our plight sometimes in maintenance engineering. Jochaim's answer that 0x81 is not defined in Windows-1252 explains why I cannot recover that character.
James Cooper
@James Ah, now I understand the problem; that's a rather nasty situation to fix.
Michael Borgwardt