I recently realized that I don't fully understand Java's string encoding process.
Consider the following code:
public class Main
{
public static void main(String[] args)
{
System.out.println(java.nio.charset.Charset.defaultCharset().name());
System.out.println("ack char: ^"); /* where ^ = 0x06, the ack char */
}
}
Since the control characters are interpreted differently between windows-1252 and ISO-8859-1, I chose the ack
char for testing.
I now compile it with different file encodings, UTF-8, windows-1252, and ISO-8859-1. The both compile to the exact same thing, byte-per-byte as verified by md5sum
.
I then run the program:
$ java Main | hexdump -C
00000000 55 54 46 2d 38 0a 61 63 6b 20 63 68 61 72 3a 20 |UTF-8.ack char: |
00000010 06 0a |..|
00000012
$ java -Dfile.encoding=iso-8859-1 Main | hexdump -C
00000000 49 53 4f 2d 38 38 35 39 2d 31 0a 61 63 6b 20 63 |ISO-8859-1.ack c|
00000010 68 61 72 3a 20 06 0a |har: ..|
00000017
$ java -Dfile.encoding=windows-1252 Main | hexdump -C
00000000 77 69 6e 64 6f 77 73 2d 31 32 35 32 0a 61 63 6b |windows-1252.ack|
00000010 20 63 68 61 72 3a 20 06 0a | char: ..|
00000019
It correctly outputs the 0x06
no matter which encoding is being used.
Ok, it still outputs the same 0x06
, which would be interpreted as the printable [ACK] char by windows-1252 codepages.
That leads me to a few questions:
- Is the codepage / charset of the Java file being compiled expected to be identical to the default charset of the system under which it's being compiled? Are the two always synonymous?
- The compiled representation doesn't seem dependent on the compile-time charset, is this indeed the case?
- Does this imply that strings within Java files may be interpreted differently at runtime if they don't use standard characters for the current charset/locale?
- What else should I really know about string and character encoding in Java?