views:

493

answers:

4

I'm trying to understand the basics of practical programming around character encodings.

A few things to consider:

  • I know how to read a file whose encoding is different, and convert it to the console's encoding.
  • But when I try to convert literal strings that appear in source code, for some reason, it doesn't always work:
    • In IntelliJ's console for the clojure language (its REPL or interactive interpreter), it doesn't work at all. I haven't look if this particular console is different than IntelliJ's standard java console.
    • In Apple's Terminal, it sometimes works fine, depending on the source file's encoding.
    • In Eclipse and Netbeans, it always works fine.

There's lots of resources to learn about Unicode and character encodings. But AFAIK, not much to learn practical usage guidelines. Some other questions here on StackOverflow have been useful, but none has been enough for what I'm trying to do.

UPDATE: I have greatly simplified this question after having understood how general the problems I was facing were. Originally, it was specifically targeted at the Java platform, with a code example in the clojure language. To see these, have a look at the first version of this question.

+1  A: 

Besides the point that the code you show is not Java I would recommend to look at ICU (http://site.icu-project.org/) the open source Unicode library that is available in Java and C++.

lothar
Wow, looks awesome, thank you! I don't know if it'll help me solve my problem, but I'll definitely experiment with ICU as soon as I've understood how to solve my issues without ICU.
Daniel Jomphe
+2  A: 

Your problem is related to how your IDE tells the Java compiler to interpret the source file's encoding. (Console output might be another problem. don't know)

If you run the javac program with no arguments you get a help print (excert below) that hints you as to how it works.

 -encoding <encoding>       Specify character encoding used by source files

Javac thus interprets the source file, with its literal strings and all, turning it into UTF8 i think in the byte code. I'm sure the Closure compiler has a similar option.

In Eclipse, the option to decide what encoding source files have is under General > Workspace > Text file encoding. Under my Swedish Windows machine, the selected default was CP1252. (I don't care what's there since i avoid using characters outside ASCII for exactly this reason.)

Hugo
I think it's a gotcha. I won't be able to confirm it until Monday, though. Java's compiler reads source files using the system platform's default. Clojure's uses UTF8 instead. Good thread about it: http://groups.google.com/group/clojure/browse_thread/thread/1ebe3c8f342f3abe/d0497724d342e27f?lnk=raot
Daniel Jomphe
You were right on point. I changed the encoding of my source files and which encoding my IDE uses to read them, so that everything matches. It didn't solve my problems with one environment, but I now understand it's a console output-related issue that I may submit as a bug to its author.
Daniel Jomphe
(Namely, it looks like the specific console only supports ASCII characters, or some other undocumented encoding.) Thanks for your help.
Daniel Jomphe
ok. that's interesting. Isolating the sources for error is always good i guess. nice to be of help!:D
Hugo
+1  A: 

The -encoding option of javac tells the compiler what character encoding the source files use.

IDEs usually default to the platform character encoding, but can be set to use an encoding that you specify. Then they go another step to let you override the encoding on a single file.

If your editor or IDE is using something other than the platform default, then you compile or edit the files with a different tool, you need to make sure both tools have explicitly specified the same encoding.

erickson
I now see better the flow of this all, thank you. It should definitely help. So as a developer, I need to be consistent with my platform until compilation. From there, I need to be consistent with the user's platform. Right?
Daniel Jomphe
That's right; when you send output to the console (at runtime on the user's machine), you need to make sure that you are using the console's encoding, which is usually the platform default.
erickson
+2  A: 

As a record of the knowledge that is good to have to be able to solve this kind of problems, here are some highlights:

  1. Verify the encoding of each file your program uses. This includes source files and data files, be they fetched locally or on any network.
    1. Make sure that what reads the source files knows their respective encodings:
      • If you use an IDE, verify which encoding it uses for its following settings:
        • IDE-wide encoding
        • Project-wide encoding
        • Module-wide encoding
        • And its file-specific encoding.
        • Of course, you'll probably want to standardize them all with a unique encoding.
      • If you use any kind of build tool or compiler outside of an IDE, verify its settings.
    2. Make sure that what reads the data files knows their respective encodings. You'll use your programming language's features to decode each data file from its original encoding.
  2. Verify what encoding is needed by the users of every kind of character data your program produces. You'll use your programming language's features to encode everything how it should be:
    • User interface
    • Files created or modified by your program, including:
      • Network communications
      • Log files.

The following tips contributed by other people, might prove highly useful:

  • Don't use the default platform encoding unless you're really, really sure you mean to.
  • Prefer formats that carry their own encoding information. XML is a good example: All valid XML files have a very clearly defined encoding; parsing them doesn't depend on the encoding being specified by some external means.

See also the following learning resources:

And to widen the subject, see What Issues prevent Java applications from working on multiple platforms?.

Daniel Jomphe
I'd add "prefer formats that carry their own encoding information". XML is a good example: All valid XML files have a very clearly defined encoding and parsing them doesn't depend on the encoding being specified by some external means.
Joachim Sauer
Thanks saua; I'll edit accordingly. Also, I think you could yourself edit this answer; to anybody who wishes to do so, go ahead, even if it means making it a community answer. (Not used to this, wanted to make sure you feel free to do so if it makes sense.)
Daniel Jomphe