views:

825

answers:

5

Why is the following displayed different in Linux vs Windows?

System.out.println(new String("¿".getBytes("UTF-8"), "UTF-8"));

in Windows:

¿

in Linux:

¿

A: 

It's hard to know exactly which bytes your source code contains, or the string which getBytes() is being called on, due to your editor and compiler encodings.

Can you produce a short but complete program containing only ASCII (and the relevant \uxxxx escaping in the string) which still shows the problem?

I suspect the problem may well be with the console output on either Windows or Linux, but it would be good to get a reproducible program first.

Jon Skeet
+7  A: 

Not sure where the problem is exactly, but it's worth noting that

¿ ( 0xc2,0xbf)

is the result of encoding with UTF-8

0xbf,

which is the Unicode codepoint for ¿

So, it looks like in the linux case, the output is not being displayed as utf-8, but as a single-byte string

Juan Pablo Califano
+1. I recently saw   being displayed as C2A0 and went wtf.
Amarghosh
+5  A: 

Check what encoding your linux terminal has.

For gnome-terminal in ubuntu - go to the "Terminal" menu and select "Set Character Encoding".

For putty, Configuration -> Window -> Translation -> UTF-8 (and if that doesn't work, see this post).

Hamish Downer
+1  A: 

Run this code to help determine if it is a compiler or console issue:

public static void main(String[] args) throws Exception {
 String s = "¿";
 printHex(Charset.defaultCharset(), s);

 Charset utf8 = Charset.forName("UTF-8");
 printHex(utf8, s);
}

public static void printHex(Charset encoding, String s)
  throws UnsupportedEncodingException {
 System.out.print(encoding + "\t" + s + "\t");

 byte[] barr = s.getBytes(encoding);
 for (int i = 0; i < barr.length; i++) {
  int n = barr[i] & 0xFF;
  String hex = Integer.toHexString(n);
  if (hex.length() == 1) {
   System.out.print('0');
  }
  System.out.print(hex);
 }
 System.out.println();
}

If the encoded bytes for UTF-8 are different on each platform (it should be c2bf), it is a compiler issue.

If it is a compiler issue, replace "¿" with "\u00bf".

McDowell
+12  A: 

System.out.println() outputs the text in the system default encoding, but the console interprets that output according to its own encoding (or "codepage") setting. On your Windows machine the two encodings seem to match, but on the Linux box the output is apparently in UTF-8 while the console is decoding it as a single-byte encoding like ISO-8859-1. Or maybe, as Jon suggested, the source file is being saved as UTF-8 and javac is reading it as something else, a problem that can be avoided by using Unicode escapes.

When you need to output anything other than ASCII text, your best bet is to write it to a file using an appropriate encoding, then read the file with a text editor--consoles are too limited and too system-dependent. By the way, this bit of code:

new String("¿".getBytes("UTF-8"), "UTF-8")

...has no effect on the output. All that does is encode the contents of the string to a byte array and decode it again, reproducing the original string--an expensive no-op. If you want to output text in a particular encoding, you need to use an OutputStreamWriter, like so:

FileOutputStream fos = new FileOutputStream("out.txt");
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
Alan Moore