views:

337

answers:

5

Using Java 6 to get 8-bit characters from a String:

System.out.println(Arrays.toString("öä".getBytes("ISO-8859-1")));

gives me, on Linux: [-10, 28] but OS X I get: [63, 63, 63, -89]

I seem get the same result when using the fancy new nio CharSetEncoder class. What am I doing wrong? Or is it Apple's fault? :)

+1  A: 

Maybe the character set for the source is not set (and thus different according to system locale)?

Can you run the same compiled class on both systems (not re-compile)?

Thilo
That's most definitely the case. That code, compiled correctly, will produce the same output on all supported platforms.
Joachim Sauer
+2  A: 

What is the encoding of the source file? 63 is the code for ? which means "character can't be converted to the specified encoding".

So my guess is that you copied the source file to the Mac and that the source file uses an encoding which the Mac java compiler doesn't expect. IIRC, OS X will expect the file to be UTF-8.

Aaron Digulla
A: 

Bear in mind that there's more than one way to represent characters. Mac OS X uses unicode by default, so your string literal may actually not be represented by two bytes. You need to make sure that you load the string from the appropriate incoming character set; for example, by specifying in the source a \u escape character.

AlBlue
For what it's worth, an accented character can be represented in two ways; as a single glyph (ö) or as a combining character (\u0308 o).
AlBlue
Well, in this case the Java file is generated code. Changing the way these characters are encoded in the literal is not feasible.
Chip Zero
+1  A: 

Your source file is producing "öä" by combining characters.

Look at this:

System.out.println(Arrays.toString("\u00F6\u00E4".getBytes("ISO-8859-1")))

This shall print [-10,-28] like you expect (I don't like to print it this way but I know it's not the point of your question), because there the Unicode codepoints are specified, carved in stone, and your text editor is not allowed to "play smart" by combining 'o' and 'a' with diacritic signs.

Typically, when you encounter such problems you probably want to use two OS X Un*x commmands to figure what's going on under the hood: file and hexdump are very convenient in such cases.

You want to run them on your source file and you may want to run them on your class file.

Webinator
Useful little tools. So how come javac doesn't know that this is a UTF-8 file?
Chip Zero
+2  A: 

I managed to reproduce this problem by saving the source file as UTF-8, then telling the compiler it was really MacRoman:

javac -encoding MacRoman Test.java

I would have thought javac would default to UTF-8 on OSX, but maybe not. Or maybe you're using an IDE and it's defaulting to MacRoman. Whatever the case, you have to make it use UTF-8 instead.

Alan Moore
It seems MacRoman is the default encoding on my OSX system. The source file with this literal is encoded in UTF-8 and it incorrectly parses it as MacRoman. So how to fix this? Specifying -encoding UTF-8 doesn't seem like a good option. What if I have some good old ISO-8859-1 files in there?
Chip Zero
If some of your files are ISO-8859-1, you'll have to compile them separately anyway and specify *that* encoding. I suggest you always specify UTF-8, both for saving and compiling. If a MacRoman or ISO-8859-1 file sneaks in, you'll know about it when compilation fails; it's a lot harder to trick UTF-8 into accepting bogus data than it is most other encodings.
Alan Moore
I figured it would switch to ISO-8859-1 if it couldn't read a file as UTF-8. But that doesn't seem to be the case on my Linux box. So '-encoding utf-8' gives the same behavior. I still don't feel entirely comfortable about using this switch, but I realize I will have to to fix it on my OSX box and similar systems.I can't help but wonder if there isn't a global "fix" so my system won't explode next time I run into a project that uses UTF-8 string literals and I don't have a unit test to catch the problem?
Chip Zero
The platform-default encoding is one of Java's dirty little secrets; it's *never* safe to rely on it. The global fix you're looking for is to always specify an encoding when you save and compile source files. And that encoding might as well be UTF-8, since it can handle every character known to Unicode and it's guaranteed to be supported on every platform.
Alan Moore