tags:

views:

609

answers:

4

Hello,

I am trying to decode some UTF-8 strings in Java. These strings contain some combining unicode characters, such as CC 88 (combining diaresis). The character sequence seems ok, according to http://www.fileformat.info/info/unicode/char/0308/index.htm

But the output after conversion to String is invalid. Any idea ?

byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
    int value = utf8[i] & 0xFF;
    System.out.print(Integer.toHexString(value));
}
System.out.println("}}");
System.out.println(">" + new String(utf8, "UTF-8"));

Output:

    {{69cc88}}
    >i?
+7  A: 

The console which you're outputting to (e.g. windows) may not support unicode, and may mangle the characters. The console output is not a good representation of the data.

Try writing the output to a file instead, making sure the encoding is correct on the FileWriter, then open the file in a unicode-friendly editor.

Alternatively, use a debugger to make sure the characters are what you expect. Just don't trust the console.

skaffman
+1: On Ubuntu 9.04 in a terminal (gnome-terminal) the output is the i with diaresis as you probably expect it.
Joachim Sauer
I'm liking this word "diaeresis". I may have to use it more often in conversation.
skaffman
:)try also "umlaut", and you'll be the man of the evening.
Eric Nicolas
+3  A: 

The code is fine, but as skaffman said your console probably doesn't support the appropriate character.

To test for sure, you need to print out the unicode values of the character:

public class Test {
    public static void main(String[] args) throws Exception {
        byte[] utf8 = { 105, -52, -120 };
        String text = new String(utf8, "UTF-8");
        for (int i=0; i < text.length(); i++) {
            System.out.println(Integer.toHexString(text.charAt(i)));
        }
    }
}

This prints 69, 308 - which is correct (U+0069, U+0308).

Jon Skeet
+1  A: 

You are both right. Thanks !!

Here how I finally solved the problem, in Eclipse on Windows :

  • In the Run Configuration, Arguments tab, I added "-Dfile.encoding=UTF-8" to the VM arguments
  • In the Run configuration, Common tab, I set the Console Encoding to UTF-8

And I modified the code as follow :

byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
    int value = utf8[i] & 0xFF;
    System.out.print(Integer.toHexString(value));
}
System.out.println("}}");

PrintStream sysout = new PrintStream(System.out, true, "UTF-8");
sysout.print(">" + new String(utf8, "UTF-8"));

Output:

{{69cc88}}
> ï

Thanks !

Eric Nicolas
You should not need the "-Dfile.encoding=UTF-8" switch if you are going to encode the data yourself using a PrintStream. (Manually setting the "file.encoding" property may be problematic for any code that needs to know the system encoding.)
McDowell
+1  A: 

Java, not unreasonably, encodes Unicode characters into native system encoded bytes before it writes them to stdout. Some operating systems, like many Linux distros, use UTF-8 as their default character set, which is nice.

Things are a bit different on Windows for a variety of backwards-compatibility reasons. The default system encoding will be one of the "ANSI" codepages and if you open the default command prompt (cmd.exe) it will be one of the old "OEM" DOS codepages (though it is possible to get ANSI and Unicode there with a bit of work).

Since U+0308 isn't in any of the "ANSI" character sets (probably 1252 in your case), it'll get encoded as an error character (usually a question mark).

An alternative to Unicode-enabling everything is to normalize the combining sequence U+0069 U+0308 to the single character U+00EF:

  public static void emit(String foo) throws IOException {
    System.out.println("Literal: " + foo);
    System.out.print("Hex: ");
    for (char ch : foo.toCharArray()) {
      System.out.print(Integer.toHexString(ch & 0xFFFF) + " ");
    }
    System.out.println();
  }

  public static void main(String[] args) throws IOException {
    String foo = "\u0069\u0308";
    emit(foo);
    foo = Normalizer.normalize(foo, Normalizer.Form.NFC);
    emit(foo);
  }

Under windows-1252, this code will emit:

Literal: i?
Hex: 69 308 
Literal: ï
Hex: ef 
McDowell