ansaurus

Question

Answer 1

+7 A:

The console which you're outputting to (e.g. windows) may not support unicode, and may mangle the characters. The console output is not a good representation of the data.

Try writing the output to a file instead, making sure the encoding is correct on the FileWriter, then open the file in a unicode-friendly editor.

Alternatively, use a debugger to make sure the characters are what you expect. Just don't trust the console.

skaffman 2009-08-13 13:43:02

+1: On Ubuntu 9.04 in a terminal (gnome-terminal) the output is the i with diaresis as you probably expect it.

Joachim Sauer 2009-08-13 13:48:10

I'm liking this word "diaeresis". I may have to use it more often in conversation.

skaffman 2009-08-13 13:49:56

:)try also "umlaut", and you'll be the man of the evening.

Eric Nicolas 2009-08-13 13:55:56

Answer 2

+3 A:

The code is fine, but as skaffman said your console probably doesn't support the appropriate character.

To test for sure, you need to print out the unicode values of the character:

public class Test {
    public static void main(String[] args) throws Exception {
        byte[] utf8 = { 105, -52, -120 };
        String text = new String(utf8, "UTF-8");
        for (int i=0; i < text.length(); i++) {
            System.out.println(Integer.toHexString(text.charAt(i)));
        }
    }
}

This prints 69, 308 - which is correct (U+0069, U+0308).

Jon Skeet 2009-08-13 13:51:31

Answer 3

+1 A:

You are both right. Thanks !!

Here how I finally solved the problem, in Eclipse on Windows :

In the Run Configuration, Arguments tab, I added "-Dfile.encoding=UTF-8" to the VM arguments
In the Run configuration, Common tab, I set the Console Encoding to UTF-8

And I modified the code as follow :

byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
    int value = utf8[i] & 0xFF;
    System.out.print(Integer.toHexString(value));
}
System.out.println("}}");

PrintStream sysout = new PrintStream(System.out, true, "UTF-8");
sysout.print(">" + new String(utf8, "UTF-8"));

Output:

{{69cc88}}
> ï

Thanks !

Eric Nicolas 2009-08-13 14:23:09

You should not need the "-Dfile.encoding=UTF-8" switch if you are going to encode the data yourself using a PrintStream. (Manually setting the "file.encoding" property may be problematic for any code that needs to know the system encoding.)

McDowell 2009-08-13 14:41:39

Answer 4

+1 A:

Java, not unreasonably, encodes Unicode characters into native system encoded bytes before it writes them to stdout. Some operating systems, like many Linux distros, use UTF-8 as their default character set, which is nice.

Things are a bit different on Windows for a variety of backwards-compatibility reasons. The default system encoding will be one of the "ANSI" codepages and if you open the default command prompt (cmd.exe) it will be one of the old "OEM" DOS codepages (though it is possible to get ANSI and Unicode there with a bit of work).

Since U+0308 isn't in any of the "ANSI" character sets (probably 1252 in your case), it'll get encoded as an error character (usually a question mark).

An alternative to Unicode-enabling everything is to normalize the combining sequence U+0069 U+0308 to the single character U+00EF:

  public static void emit(String foo) throws IOException {
    System.out.println("Literal: " + foo);
    System.out.print("Hex: ");
    for (char ch : foo.toCharArray()) {
      System.out.print(Integer.toHexString(ch & 0xFFFF) + " ");
    }
    System.out.println();
  }

  public static void main(String[] args) throws IOException {
    String foo = "\u0069\u0308";
    emit(foo);
    foo = Normalizer.normalize(foo, Normalizer.Form.NFC);
    emit(foo);
  }

Under windows-1252, this code will emit:

Literal: i?
Hex: 69 308 
Literal: ï
Hex: ef

McDowell 2009-08-13 15:33:27

ansaurus

tags:

views:

answers:

Java UTF-8 strange behaviour

related questions