views:

462

answers:

4

Hi

I have a problem with turkish special characters on different machines. The following code:

String turkish = "ğüşçĞÜŞÇı";

String test1 = new String(turkish.getBytes());
String test2 = new String(turkish.getBytes("UTF-8"));
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");

System.out.println(test1);
System.out.println(test2);
System.out.println(test3);

On a Mac the three Strings are the same as the original string. On a Windows machine the three lines are (Printed with the Netbeans 6.7 console):

?ü?ç?Ü?Ç?
ğüşçĞÜŞÇı
?ü?ç?Ü?Ç?

I don't get the problem.

+3  A: 

Don't rely on the console, or on the default platform encoding. Always specify the character encoding for calls like getBytes and the String constructor taking a byte array, and if you want to examine the contents of a string, print out the unicode value of each character.

I would also advise either restricting your source code to use ASCII (and \uxxxx to encode non-ASCII characters) or explicitly specifying the character encoding when you compile.

Now, what bigger problem are you trying to solve?

Jon Skeet
+1. It is a shame that so much of the Java stdlib has default arguments for the Encoding; there's almost never a good reason to rely on the default encoding, it just encourages horrible bugs and deployment issues.
bobince
+1  A: 

You may be dealing with different settings of the default encoding.

java -Dfile.encoding=utf-8

versus

java -Dfile.encoding=something else

Or, you may just be seeing the fact that the Mac terminal window works in UTF-8, and the Windows DOS box does not work in UTF-8.

As per Mr. Skeet, you have a third possible problem, which is that you are trying to embed UTF-8 chars in your source. Depending on the compiler options, you may or may not be getting what you intend there. Put this data in a properties file, or use \u escapes.

Finally, also per Mr. Skeet, never, ever call the zero-argument getBytes().

bmargulies
A: 

If you are using AspectJ compiler do not forget to set it's encoding to UTF-8 too. I have struggled to find this for hours.

nimcap
+4  A: 
String test1 = new String(turkish.getBytes());

You're taking the Unicode String including the Turkish characters, and turning it into bytes using the default encoding (using the default encoding is usually a mistake). You're then taking those bytes and decoding them back into a String, again using the default encoding. The result is you've achieved nothing (except losing any characters that don't fit in the default encoding); whether you have put a String through an encode/decode cycle has no effect on what the following System.out.println(test1) does because that's still printing a String and not bytes.

String test2 = new String(turkish.getBytes("UTF-8"));

Encodes as UTF-8 and then decodes using the default encoding. On Mac the default encoding is UTF-8 so this does nothing. On Windows the default encoding is never UTF-8 so the result is the wrong characters.

String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");

Does precisely nothing.

To write Strings to stdout with a different encoding than the default encoding, you'd create a encoder something like new OutputStreamWriter(System.out, "cp1252") and send the string content to that.

However in this case, it looks like the console is using Windows code page 1252 Western European (+1 ATorres). There is no encoding mismatch issue here at all, so you won't be able to solve it by re-encoding strings!

The default encoding cp1252 matches the console's encoding, it's just that cp1252 doesn't contain the Turkish characters ğşĞŞı at all. You can see the other characters that are in cp1252, üçÜÇ, come through just fine. Unless you can reconfigure the console to use a different encoding that does include all the characters you want, there is no way you'll be able to output those characters.

Presumably on a Turkish Windows install, the default code page will be cp1254 instead and you will get the characters you expect (but other characters don't work). You can test this by changing the ‘Language to use for non-Unicode applications’ setting in the Regional and Language Options Control Panel app.

Unfortunately no Windows locale uses UTF-8 as the default code page. Putting non-ASCII output onto the console with the stdio stream functions is not something that's really reliable at all. There is a Win32 API to write Unicode directly to the console, but unfortunately nothing much uses it.

bobince