ansaurus

Question

How do you handle different character encodings?

Answer 1

+1 A:

Besides the point that the code you show is not Java I would recommend to look at ICU (http://site.icu-project.org/) the open source Unicode library that is available in Java and C++.

lothar 2009-04-03 22:36:58

Wow, looks awesome, thank you! I don't know if it'll help me solve my problem, but I'll definitely experiment with ICU as soon as I've understood how to solve my issues without ICU.

Daniel Jomphe 2009-04-04 14:22:56

Answer 2

+2 A:

Your problem is related to how your IDE tells the Java compiler to interpret the source file's encoding. (Console output might be another problem. don't know)

If you run the javac program with no arguments you get a help print (excert below) that hints you as to how it works.

 -encoding <encoding>       Specify character encoding used by source files

Javac thus interprets the source file, with its literal strings and all, turning it into UTF8 i think in the byte code. I'm sure the Closure compiler has a similar option.

In Eclipse, the option to decide what encoding source files have is under General > Workspace > Text file encoding. Under my Swedish Windows machine, the selected default was CP1252. (I don't care what's there since i avoid using characters outside ASCII for exactly this reason.)

Hugo 2009-04-03 22:57:22

I think it's a gotcha. I won't be able to confirm it until Monday, though. Java's compiler reads source files using the system platform's default. Clojure's uses UTF8 instead. Good thread about it: http://groups.google.com/group/clojure/browse_thread/thread/1ebe3c8f342f3abe/d0497724d342e27f?lnk=raot

Daniel Jomphe 2009-04-04 14:54:47

You were right on point. I changed the encoding of my source files and which encoding my IDE uses to read them, so that everything matches. It didn't solve my problems with one environment, but I now understand it's a console output-related issue that I may submit as a bug to its author.

Daniel Jomphe 2009-04-06 18:55:43

(Namely, it looks like the specific console only supports ASCII characters, or some other undocumented encoding.) Thanks for your help.

Daniel Jomphe 2009-04-06 18:57:01

ok. that's interesting. Isolating the sources for error is always good i guess. nice to be of help!:D

Hugo 2009-04-07 19:07:04

Answer 3

+1 A:

The -encoding option of javac tells the compiler what character encoding the source files use.

IDEs usually default to the platform character encoding, but can be set to use an encoding that you specify. Then they go another step to let you override the encoding on a single file.

If your editor or IDE is using something other than the platform default, then you compile or edit the files with a different tool, you need to make sure both tools have explicitly specified the same encoding.

erickson 2009-04-03 22:58:43

I now see better the flow of this all, thank you. It should definitely help. So as a developer, I need to be consistent with my platform until compilation. From there, I need to be consistent with the user's platform. Right?

Daniel Jomphe 2009-04-04 14:58:51

That's right; when you send output to the console (at runtime on the user's machine), you need to make sure that you are using the console's encoding, which is usually the platform default.

erickson 2009-04-04 16:01:47

Answer 4

+2 A:

As a record of the knowledge that is good to have to be able to solve this kind of problems, here are some highlights:

Verify the encoding of each file your program uses. This includes source files and data files, be they fetched locally or on any network.
1. Make sure that what reads the source files knows their respective encodings:
  - If you use an IDE, verify which encoding it uses for its following settings:
    - IDE-wide encoding
    - Project-wide encoding
    - Module-wide encoding
    - And its file-specific encoding.
    - Of course, you'll probably want to standardize them all with a unique encoding.
  - If you use any kind of build tool or compiler outside of an IDE, verify its settings.
2. Make sure that what reads the data files knows their respective encodings. You'll use your programming language's features to decode each data file from its original encoding.
Verify what encoding is needed by the users of every kind of character data your program produces. You'll use your programming language's features to encode everything how it should be:
- User interface
- Files created or modified by your program, including:
  - Network communications
  - Log files.

The following tips contributed by other people, might prove highly useful:

Don't use the default platform encoding unless you're really, really sure you mean to.
Prefer formats that carry their own encoding information. XML is a good example: All valid XML files have a very clearly defined encoding; parsing them doesn't depend on the encoding being specified by some external means.

See also the following learning resources:

Jon Skeet's Debugging Unicode Problems article, with a few more technically-inclined tricks.
- How Jon Skeet applies his knowledge to Java.

And to widen the subject, see What Issues prevent Java applications from working on multiple platforms?.

Daniel Jomphe 2009-04-06 18:53:32

I'd add "prefer formats that carry their own encoding information". XML is a good example: All valid XML files have a very clearly defined encoding and parsing them doesn't depend on the encoding being specified by some external means.

Joachim Sauer 2009-04-06 21:02:21

Thanks saua; I'll edit accordingly. Also, I think you could yourself edit this answer; to anybody who wishes to do so, go ahead, even if it means making it a community answer. (Not used to this, wanted to make sure you feel free to do so if it makes sense.)

Daniel Jomphe 2009-04-07 13:25:35

ansaurus

tags:

views:

answers:

How do you handle different character encodings?

related questions