A Java char
is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters. Does this mean that you can't handle certain Unicode characters in a Java application?
Does this boil down to what character encoding you are using?
You can handle them all if you're careful enough.
Java's char
is a UTF-16 code unit. For characters with code-point > 0xFFFF it will be encoded with 2 char
s (a surrogate pair).
See http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ for how to handle those characters in Java.
(BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.)
From the OpenJDK7 documentation for String:
A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.
Java uses UTF-16. A single Java char
can only represent characters from the basic multilingual plane. Other characters have to be represented by a surrogate pair of two char
s. This is reflected by API methods such as String.codePointAt()
.
And yes, this means that a lot of Java code will break in one way or another when used with characters outside the basic multilingual plane.
Have a look at the Unicode 4.0 support in J2SE 1.5 article to learn more about the tricks invented by Sun to provide support for all Unicode 4.0 code points.
In summary, you'll find the following changes for Unicode 4.0 in Java 1.5:
char
is a UTF-16 code unit, not a code point- new low-level APIs use an
int
to represent a Unicode code point- high level APIs have been updated to understand surrogate pairs
- a preference towards char sequence APIs instead of char based methods
Since Java doesn't have 32 bit chars, I'll let you judge if we can call this good Unicode support.
To add to the other answers, some implicancies:
A Java char is always 16 bits.
A Unicode char, encoded as UTF-16 is "almost always" (but not always) 16 bits. There are more than 64K unicode characters. Then, a Java char is NOT a Unicode char (thought "almost always" is).
"Almost always", in the above, means the 64K first characters, range 0x0000 to 0xFFF (BMP), which occupy 16 bits in the UTF-16 encoding.
A "rare" (non-BMP) Unicode character is represented as two Java chars (surrogate representation). This applies also to the literal representation in a string: For example, the character U+20000 is written as "\uD840\uDC00".
Corolary: String.length() returns the number of java chars, not of unicode chars. A string that has just one "rare" unicode character (eg U+20000) would return length() = 2 . Same consideration applies to any method that deals with char-sequences.
Java has little intelligence for dealing with non-BMP unicode characters as a whole. There are some utility methods that treat characters as code-points, represented as ints eg: Character.isLetter(int ch). Those are the ones really fully Unicode.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)