EDIT: I've been convinced that this question is somewhat non-sensical. Thanks to those who responded. I may post a follow-up question that is more specific.
Today I was investing some encoding problems and wrote this unit test to isolate a base repro case:
int badCount = 0;
for (int i = 1; i < 255; i++) {
String str = "Hi " + new String(new char[] { (char) i });
String toLatin1 = new String(str.getBytes("UTF-8"), "latin1");
assertEquals(str, new String(toLatin1.getBytes("latin1"), "UTF-8"));
String toWin1252 = new String(str.getBytes("UTF-8"), "Windows-1252");
String fromWin1252 = new String(toWin1252.getBytes("Windows-1252"), "UTF-8");
if (!str.equals(fromWin1252)) {
System.out.println("Can't encode: " + i + " - " + str +
" - encodes as: " + fromWin1252);
badCount++;
}
}
System.out.println("Bad count: " + badCount);
The output:
Can't encode: 129 - Hi ? - encodes as: Hi ?? Can't encode: 141 - Hi ? - encodes as: Hi ?? Can't encode: 143 - Hi ? - encodes as: Hi ?? Can't encode: 144 - Hi ? - encodes as: Hi ?? Can't encode: 157 - Hi ? - encodes as: Hi ?? Can't encode: 193 - Hi Á - encodes as: Hi ?? Can't encode: 205 - Hi Í - encodes as: Hi ?? Can't encode: 207 - Hi Ï - encodes as: Hi ?? Can't encode: 208 - Hi ? - encodes as: Hi ?? Can't encode: 221 - Hi ? - encodes as: Hi ?? Bad count: 10
JDK 1.6.0_07 on Mac OS 10.6.2
My observation:
Latin1 symmetrically encodes all 254 characters. Windows-1252 does not. The three printable characters (193, 205, 207) are the same codes in Latin1 and Windows-1252, so I wouldn't expect any issues.
Can anyone explain this behavior? Is this a JDK bug?
-- James