Why does the Java ecosystem use different character encodings throughout their software stack?

views:

answers:

+4 Q:

Why does the Java ecosystem use different character encodings throughout their software stack?

For instance class files use CESU-8 (sometimes also called MUTF-8), but internally Java first used UCS-2 and now it uses UTF-16. The specification about valid Java source files says that a minimal conforming Java compiler only has to accept ASCII characters.

What's the reason for these choices? Wouldn't it make more sense to use the same encoding throughout the Java ecosystem?

+3 A:

MUTF-8 for efficiency, UCS2 for hysterical raisins. :)

In 1993, UCS2 was Unicode; everyone thought 65536 Characters Ought To Be Enough For Everyone.

Later on, when it became clear that indeed, there are an awful lot of languages in the world, it was too late, not to mention a terrible idea, to redefine 'char' to be 32 bits, so instead a mostly-backward-compatible choice was made.

In a way that's closely analogous to the relationship between ASCII and UTF-8, Java strings that don't stray outside the historical UCS2 boundaries are bit-identical to their UTF16 representation. It's only when you colour outside those lines that you have to start worrying about surrogates, etc.

Alex Cruise 2010-07-13 19:10:19

+2 A:

It seems to be a common software development problem. Early code is one standard, usually the simplest to implement at the time it was created, then later versions add in support for newer/better/less common/more complex standards.

A minimal complier probably only has to take ASCII because thats what many common editors use. These editors may not be ideal for working with Java and nowhere near a full IDE, but are often adequate to tweak one source file.

Java seems to have attempted to set the bar higher and handle UTF character sets but they also left that ASCII 'bailout' option in place. I'm sure there are notes from some committee meeting that explain why.

Freiheit 2010-07-13 19:13:59

+3 A:

ASCII for source files is because at the time it wasn't considered reasonable to expect people to have text editors with full Unicode support. Things have improved since, but they still aren't perfect. The whole \uXXXX thing in Jave is essentially Java's equivalent to C's trigraphs. (When C was created, some keyboards didn't have curly braces, so you had to use trigraphs!)

At the time Java was created, the class file format used UTF-8 and the runtime used UCS-2. Unicode had less than 64k codepoints, so 16 bits was enough. Later, when additional "planes" were added to Unicode, UCS-2 was replaced with the (pretty much) compatible UTF-16, and UTF-8 was replaced with CESU-8 (hence "Compatibility Encoding Scheme...").

In the class file format they wanted to use UTF-8 to save space. The design of the class file format (including the JVM instruction set) was heavily geared towards compactness.

In the runtime they wanted to use UCS-2 because it was felt that saving space was less important than being able to avoid the need to deal with variable-width characters. Unfortunately, this kind of backfired now that it's UTF-16, because a codepoint can now take multiple "chars", and worse, the "char" datatype is now sort of misnamed (it no longer corresponds to a character, in general, but instead corresponds to a UTF-16 code-unit).

Laurence Gonsalves 2010-07-13 19:21:55

C and C++ on Windows have an even worse naming convention: Not only does `char` not correspond to a character, but `wchar_t` doesn't either!

dan04 2010-07-14 04:41:53

A character is a abstract concept that cannot be represented using an integral data type. A `wchar_t` on Windows or a `char` in Java is a UTF-16 code unit.

Philipp 2010-07-14 06:26:46

@Phillip: Why do you say a character cannot be represented with an integral datatype? I assume you're alluding to the existence of composite characters. While that would make it harder to represent characters with an integral type, so long as the set of characters is countable, you can represent characters with an integral datatype. But my point was just that it would be much nicer if char/Character actually corresponded to "Unicode codepoint" rather than the even more confusing/arbitrary "UTF-16 code unit".

Laurence Gonsalves 2010-07-14 22:16:41

ansaurus

tags:

views:

answers:

Why does the Java ecosystem use different character encodings throughout their software stack?

related questions