views:

404

answers:

2

When using the scala interpreter (i.e. running the command 'scala' on the commandline), I am not able to print unicode characters correctly. Of course a-z, A-Z, etc. are printed correctly, but for example € or ƒ is printed as a ?.

print(8364.toChar)

results in ? instead of €. Probably I'm doing something wrong. My terminal supports utf-8 characters and even when I pipe the output to a seperate file and open it in a texteditor, ? is displayed.

This is all happening on Mac OS X (Snow Leopard, 10.6.2) with Scala 2.8 (nightly build) and Java 1.6.0_17)

+1  A: 

Ok, at least part, if not all, of your problem here is that 128 is not the Unicode codepoint for Euro. 128 (or 0x80 since hex seems to be the norm) is U+0080 <control>, i.e. it is not a printable character, so it's not surprising your terminal is having trouble printing it.

Euro's codepoint is 0x20AC (or in decimal 8364), and that appears to work for me (I'm on Linux, on a nightly of 2.8):

scala> print(0x20AC.toChar)
€

Another fun test is to print the Unicode snowman character:

scala> print(0x2603.toChar)
☃

128 as € is apparently an extended character from one of the Windows code pages.

I got the other character you mentioned to work too:

scala> 'ƒ'.toInt
res8: Int = 402

scala> 402.toChar
res9: Char = ƒ
Calum
You're right with regard to the wrong number for the Euroo symbol. However, it still doesn't work for me:scala> print(0x20AC.toChar)?But if it works in your nightly, it is probably a problem with my system or maybe it is fixed in newer Scala 2.8 builds. I will update an investigate further.
Martin Sturm
I checked this on todays' nightly (2.8.0.r20300-b20091223020158) and 'print(0x20AC.toChar)' prints a questionmark just like all the other 2.8 versions I've got lying around.
p3t0r
I'm on OSX 10.6.2 by the way.
p3t0r
Ah, I see it was a file.encoding issue from another response. Sorry! Please see my comment on your accepted answer, Martin.
Calum
+3  A: 

I found the cause of the problem, and a solution to make it work as it should. As I already suspected after posting my question and reading the answer of Calum and issues with encoding on the Mac with another project (which was in Java), the cause of the problem is the default encoding used by Mac OS X. When you start scala interpreter, it will use the default encoding for the specified platform. On Mac OS X, this is Macroman, on Windows it is probably CP1252. You can check this by typing the following command in the scala interpreter:

scala> System.getProperty("file.encoding");
res3: java.lang.String = MacRoman

According to the scala help test, it is possible to provide Java properties using the -D option. However, this does not work for me. I ended up setting the environment variable

JAVA_OPTS="-Dfile.encoding=UTF-8"

After running scala, the result of the previous command will give the following result:

scala> System.getProperty("file.encoding")
res0: java.lang.String = UTF-8

Now, printing special characters works as expected:

print(0x20AC.toChar)               
€

So, it is not a bug in Scala, but an issue with default encodings. In my opinion, it would be better if by default UTF-8 was used on all platforms. In my search for an answer if this is considered, I came across a discussion on the Scala mailing list on this issue. In the first message, it is proposes to use UTF-8 by default on Mac OS X when file.encoding reports Macroman, since UTF-8 is the default charset on Mac OS X (keeps me wondering why file.encoding by defaults is set to Macroman, probably this is an inheritance from Mac OS before 10 was released?). I don't think this proposal will be part of Scala 2.8, since Martin Odersky wrote that it is probably best to keep things as they are in Java (i.e. honor the file.encoding property).

Martin Sturm
To quote Sun: _The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution._ http://bugs.sun.com/view_bug.do?bug_id=4163515 So, it isn't supported, might not work on all JVMs and might have unintended side-effects.
McDowell
One way to do this while avoiding the issue that McDowell flags is to wrap the System.out PrintStream (which still works as a raw OutputStream) with a PrintStream which uses the encoding you want, then use that, such as "val myOut = new PrintStream( System.out, "UTF-8" ); myOut.print( 0x20AC.toChar )". This should always work.I would edit this in but I don't think I have the karma for that sort of thing.
Calum
@Calum - be interesting to see if that works on Macs; it doesn't work very well on Windows, but it may be a platform-specific issue: http://illegalargumentexception.blogspot.com/2009/04/i18n-unicode-at-windows-command-prompt.html#charsets_javaconsole
McDowell
McDowell: Ack, I only really use Linux. It's probable (although I've not checked) that the Windows console system things in characters rather than bytes, so you can't do the sort of "anything goes" nonsense on that system. On the other hand "anything goes" is part of the reason we're here ;). Thanks a lot for the update. :)
Calum
Thanks, I was wondering too, and your trick works. For the record, on Windows, we must use `set JAVA_OPTS=-Dfile.encoding=UTF-8` (no quotes). Redirected to a file, because cmd.exe uses OEM encoding for compatibility with MS-Dos, I think.
PhiLho
Reading with interest the illegalargumentexception blog entry. I found out that chcp command can be used to change the char set on the Windows console, but somehow setting to 65001 (UTF-8) breaks the (standard way of running the) Scala interpreter: "Take the claims of UTF-8 support in the Windows console with a pinch of salt. Batch files won't run if you set the console to this mode." So scala.bat does nothing in this mode!
PhiLho