views:

534

answers:

6

Examining the attributes of UTF-16 and UTF-8, I can't find any reason to prefer UTF-16.

However, checking out Java and C#, it looks like strings and chars there default to UTF-16. I was thinking that it might be for historic reasons, or perhaps for performance reasons, but couldn't find any information.

Anyone knows why these languages chose UTF-16? And is there any valid reason for me to do that as well?

EDIT: Meanwhile I've also found this answer, which seems relevant and has some interesting links.

+6  A: 

For many (most?) applications, you will be dealing only with characters in the Basic Multilingual Plane, so can treat UTF-16 as a fixed-length encoding.

So you avoid all the complexity of variable-length encodings like UTF-8.

Joe
+1 in fact I think Unicode version 1 only had the basic, which is why a number of platforms assumed 16-bits would be the right size for a simple character data type.
Daniel Earwicker
"I think Unicode version 1 only had the basic" - yes that's true, more details here: http://en.wikipedia.org/wiki/UTF-16/UCS-2
Joe
That's like saying "a lot of programs only care about ASCII, so can treat UTF-8 as a fixed-length encoding."
dan04
+15  A: 

East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required).

Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there's a lot of markup) it's much of a muchness.

Processing of UTF-16 for user-mode applications is slightly easier than processing UTF-8, because surrogate pairs behave in almost the same way that combining characters behave. So UTF-16 can usually be processed as a fixed-size encoding.

Dean Harding
+1 For correctly characterizing the number of bytes per character in UTF-16 and UTF-8.
Joren
I thought UTF-8 can encode up to 4 bytes which pretty much makes UTF-16 and UTF-32 useless.
Sir Psycho
Dean Harding
@Sir Psycho: UTF-8, UTF-16, and UTF-32 are all able to encode all the characters of Unicode. codeka was talking about how many bytes that result from encoding a "typical" Unicode character using UTF-8 and UTF-16.
GregS
@GregS: that too :)
Dean Harding
Oops. My mistake :)
Sir Psycho
+3  A: 

It depends on the expected character sets. If you expect heavy use of Unicode code points outside of the 7-bit ASCII range then you might find that UTF-16 will be more compact than UTF-8, since some UTF-8 sequences are more than two bytes long.

Also, for efficiency reasons, Java and C# does not take surrogate pairs into account when indexing strings. This would break down completely when using code points that are represented with UTF-8 sequences that take up an odd number of bytes.

corvuscorax
Could you elaborate about "Java and C# do not take surrogate pairs into account when indexing strings"?
Oak
If you have a string in C# (or Java) that contains surrogate pairs (SPs are used to encode characters outside the normal two-byte range), each pair will count as two 16-bit characters, rather than as 1 Unicode code point. At least for indexing and length reporting purposes.
corvuscorax
+3  A: 

I imagine C# using UTF-16 derives from the Windows NT family of operating systems using UTF-16 internally.

I imagine there are two main reasons why Windows NT uses UTF-16 internally:

  • For memory usage: UTF-32 wastes a lot of space to encode.
  • For performance: UTF-8 is much harder to decode than UTF-16. In UTF-16 characters are either a Basic Multilingual Plane character (2 bytes) or a Surrogate Pair (4 bytes). UTF-8 characters can be anywhere between 1 and 4 bytes.

Contrary to what other people have answered - you cannot treat UTF-16 as UCS-2. If you want to correctly iterate over actual characters in a string, you have to use unicode-friendly iteration functions. For example in C# you need to use StringInfo.GetTextElementEnumerator().

For further information, this page on the wiki is worth reading: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Andrew Russell
Oh, and don't forget combining characters! (Which `GetTextElementEnumerator` will also handle.)
Andrew Russell
"...you cannot treat UTF-16 as UCS-2" - but many successful real-world applications do, and get away with it because they are only using BMP characters.
Joe
Very useful link, thanks!
Oak
@Joe For simply pushing text about, just pretending each character is 2 bytes will "work" (just like you can often pretend UTF-8 is ASCII and hope for the best). In fact, that's what you're usually doing when you use `string` in C#. But what happens if I paste or load some text into your application in, say, a decomposed format? Anything that does processing on a character-by-character basis needs to do so with an actual understanding of how that text is encoded. (Fortunately most applications work on strings, not characters.)
Andrew Russell
The bigger reason is that the original Windows NT was released about the same time as Unicode 1.1, before there were supplementary planes.
dan04
+2  A: 

UTF-16 can be more efficient for representing characters in some languages such as Chinese, Japanese and Korean where most characters can be represented in one 16 bit word. Some rarely used characters may require two 16 bit words. UTF-8 is generally much more efficient for representing characters from Western European character sets - UTF-8 and ASCII are equivalent over the ASCII range (0-127) - but less efficient with Asian languages, requiring three or four bytes to represent characters that can be represented with two bytes in UTF-16.

UTF-16 has an advantage as an in-memory format for Java/C# in that every character in the Basic Multilingual Plane can be represented in 16 bits (see Joe's answer) and some of the disadvantages of UTF-16 (e.g. confusing code relying on \0 terminators) are less relevant.

richj
+4  A: 

@Oak: this too long for a comment...

I don't know about C# (and would be really surprised: it would mean they just copied Java too much) but for Java it's simple: Java was conceived before Unicode 3.1 came out.

Hence there were less than 65537 codepoints, hence every Unicode codepoint was still fitting on 16-bit and so the Java char was born.

Of course this led to crazy issues that are still affecting Java programmers (like me) today, where you have a method charAt which in some case does return neither a Unicode character nor a Unicode codepoint and a method (added in Java 5) codePointAt which takes an argument which is not the number of codepoints you want you want to skip! (you have to supply to codePointAt the number of Java char you want to skip, which makes it one of the least understood method in the String class).

So, yup, this is definitely wild and confusing most Java programmers (most aren't even aware of these issues) and, yup, it is for historical reason. At least, that was the excuse that came up with when people got mad after this issue: but it's because Unicode 3.1 wasn't out yet.

:)

NoozNooz42