tags:

views:

1639

answers:

7

Okay. I know this looks like the typical "Why didn't he just Google it or go to www.unicode.org and look it up?" answer, but for such a simple question the answer still eludes me after checking both sources.

I am pretty sure that all three of these encoding systems support all of the Unicode characters, but I need to confirm it before I make that claim in a presentation I am making about it.

Bonus question: Do these encodings differ in the number of characters they can be extended to support?

+4  A: 

They don't differ in this regard. All allow for a maximum of 4 bytes per character and so all can support the same number of characters,

This table in the unicode.org FAQ may be useful in highlighting the differences that do exist.

Dave Webb
UTF-16, as far as I know with its current surrogate pair implementation, support only up to U+10FFFF. Now, that's the upper limit for Unicode characters, but it's a long shot from full 32 bits. :-) UTF-32 and UTF-8 can support up to 32 bits.
Chris Jester-Young
However, to actually answer the question, there is no difference between the 3 encodings in their ability to represent the entire set of Unicode code points. :-)
Chris Jester-Young
Fair point - I'll remove the "32-bits" from the answer as it's misleading.
Dave Webb
Chris, UTF-8 can support 31 bits worth of encodings (or could before it was redefined to support only up to 0x10FFFF).
Derek Park
This answer is rather misleading. The "maximum of 4 bytes per character" has absolutely nothing to do with it. UTF-16 simply has the smallest "theoretical" capacity, and so UTF-8 and UTF-32 have been artificially restricted to what UTF-16 can encode.
Artelius
+3  A: 

I personally always check Joel's post about unicode, encodings and character sets when in doubt.

korchev
+4  A: 

UTF-8, UTF-16, and UTF-32 all support the full set of unicode code points. There are no characters that are supported by one but not another.

As for the bonus question "Do these encodings differ in the number of characters they can be extended to support?" Yes and no. The way UTF-8 and UTF-16 are encoded limits the total number of code points they can support to less than 2^32. However, the Unicode Consortium will not add code points to UTF-32 that cannot be represented in UTF-8 or UTF-16. Doing so would violate the spirit of the encoding standards, and make it impossible to guarantee a one-to-one mapping from UTF-32 to UTF-8 (or UTF-16).

Derek Park
AFAIK, there are ways to extend UTF-8 to support 32 bits fully. With UTF-16, the limit of U+10FFFF is hard-wired and cannot be overcome without completely changing the way surrogate pairs work.
Chris Jester-Young
It could originally cover 31 bits. That is the maximum that the encoding scheme can handle. (It has since been revised to cover only the Unicode code points, far less than 31 bits.)
Derek Park
+15  A: 

No, they're simply different encoding methods. They all support encoding the same set of characters.

UTF-8 uses anywhere from one to four bytes per character depending on what character you're encoding. Characters within the ASCII range take only one byte while very unusual characters take four.

UTF-32 uses four bytes per character regardless of what character it is, so it will always use more space than UTF-8 to encode the same string. The only advantage is that you can calculate the number of characters in a UTF-32 string by only counting bytes.

UTF-16 uses two bytes for most charactes, four bytes for unusual ones.

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

skoob
"so it will always use more space than UTF-8" -- you mean more or equal space.
chazomaticus
and space cheapens every day. so the extra space utf-32 uses is not important. also to find the n'th character in utf-8 you need O(n), but in utf-32 you need only O(1), which is much faster!
Joschua
+3  A: 

All of the UTF-8/16/32 encodings can map all Unicode characters. See the wiki for a comparison.

This IBM article is very helpful, and indicates if you have the choice, it's better to choose UTF-8. Mainly the reasons are wide tool support, and UTF-8 can usually pass through systems that are unaware of unicode.

From What the specs say in the IBM article:

Both the W3C and the IETF have recently become more adamant about choosing UTF-8 first, last, and sometimes only. The W3C Character Model for the World Wide Web 1.0: Fundamentals states, "When a unique character encoding is required, the character encoding MUST be UTF-8, UTF-16 or UTF-32. US-ASCII is upwards-compatible with UTF-8 (an US-ASCII string is also a UTF-8 string, see [RFC 3629]), and UTF-8 is therefore appropriate if compatibility with US-ASCII is desired." In practice, compatibility with US-ASCII is so useful it's almost a requirement. The W3C wisely explains, "In other situations, such as for APIs, UTF-16 or UTF-32 may be more appropriate. Possible reasons for choosing one of these include efficiency of internal processing and interoperability with other processes."

Robert Paulson
+3  A: 

As everyone has said, UTF-8, UTF-16, and UTF-32 can all encode all of the Unicode code points. However, the UCS-2 (sometimes mistakenly referred to as UCS-16) variant can't, and this is the one that you find e.g. in Windows XP/Vista.

See Wikipedia for more information.

Edit: I am wrong about Windows, NT was the only one to support UCS-2. However, many Windows applications will assume a single word per code point as in UCS-2, so you are likely to find bugs. See another Wikipedia article. (Thanks JasonTrue)

Mark Ransom
Actually Windows XP/Vista support UTF-16, but many apps assume unicode data is UCS2 in cases when they should be checking for surrogate pairs. This is usually not a problem for simple cases, but a mess for character iteration, caret placement, or truncating strings.
JasonTrue
+9  A: 

UTF-8

UTF-8 is a variable-length code. Some characters require 1 byte, some require 2, some 3 and some 4. The bytes for each character are simply written one after another as a continuous stream of bytes.

While some UTF-8 characters can be 4 bytes long, UTF-8 cannot encode 2^32 characters. It's not even close. I'll try to explain the reasons for this.

The software that reads a UTF-8 stream just gets a sequence of bytes - how is it supposed to decide whether the next 4 bytes is a single 4-byte character, or two 2-byte characters, or four 1-byte characters (or some other combination)? Basically this is done by deciding that certain 1-byte sequences aren't valid characters, and certain 2-byte sequences aren't valid characters, and so on. When these invalid sequences appear, it is assumed that they form part of a longer sequence.

You've seen a rather different example of this, I'm sure: it's called escaping. In many programming languages it is decided that the \ character in a string's source code doesn't translate to any valid character in the string's "compiled" form. When a \ is found in the source, it is assumed to be part of a longer sequence, like \n or \xFF. Note that \x is an invalid 2-character sequence, and \xF is an invalid 3-character sequence, but \xFF is a valid 4-character sequence.

Basically, there's a trade-off between having many characters and having shorter characters. If you want 2^32 characters, they need to be on average 4 bytes long. If you want all your characters to be 2 bytes or less, then you can't have more than 2^16 characters. UTF-8 gives a reasonable compromise: all ASCII characters (ASCII 0 to 127) are given 1-byte representations, which is great for compatibility, but many more characters are allowed.

Like most variable-length encodings, including the kinds of escape sequences shown above, UTF-8 is an instantaneous code. This means that, the decoder just reads byte by byte and as soon as it reaches the last byte of a character, it knows what the character is (and it knows that it isn't the beginning of a longer character).

For instance, the character 'A' is represented using the byte 65, and there are no two/three/four-byte characters whose first byte is 65. Otherwise the decoder wouldn't be able to tell those characters apart from an 'A' followed by something else.

But UTF-8 is restricted even further. It ensures that the encoding of a shorter character never appears anywhere within the encoding of a longer character. For instance, none of the bytes in a 4-byte character can be 65.

Since UTF-8 has 128 different 1-byte characters (whose byte values are 0-127), all 2, 3 and 4-byte characters must be composed solely of bytes in the range 128-256. That's a big restriction. However, it allows byte-oriented string functions to work with little or no modification. For instance, C's strstr() function always works as expected if its inputs are valid UTF-8 strings.

UTF-16

UTF-16 is also a variable-length code; its characters consume either 2 or 4 bytes. 2-byte values in the range 0xD800-0xDFFF are reserved for constructing 4-byte characters, and all 4-byte characters consist of two bytes in the range 0xD800-0xDBFF followed by 2 bytes in the range 0xDC00-0xDFFF. For this reason, Unicode does not assign any characters in the range U+D800-U+DFFF.

UTF-32

UTF-32 is a fixed-length code, with each character being 4 bytes long. While this allows the encoding of 2^32 different characters, only values between 0 and 0x10FFFF are allowed in this scheme.

Capacity comparison:

  • UTF-8: 2,097,152 (actually 2,166,912 but due to design details some of them map to the same thing)
  • UTF-16: 1,112,064
  • UTF-32: 4,294,967,296 (but restricted to the first 1,114,112)

The most restricted is therefore UTF-16! The formal Unicode definition has limited the Unicode characters to those that can be encoded with UTF-16 (i.e. the range U+0000 to U+10FFFF excluding U+D800 to U+DFFF). UTF-8 and UTF-32 support all of these characters.

The UTF-8 system is in fact "artificially" limited to 4 bytes. It can be extended to 8 bytes without violating the restrictions I outlined earlier, and this would yield a capacity of 2^42. The original UTF-8 specification in fact allowed up to 6 bytes, which gives a capacity of 2^31. But RFC 3629 limited it to 4 bytes, since that is how much is needed to cover all of what UTF-16 does.

There are other (mainly historical) Unicode encoding schemes, notably UCS-2 (which is only capable of encoding U+0000 to U+FFFF).

Artelius
I had to read this response three times before it made sense and almost marked it down because it appears to be flat-out wrong at first glance. I now think it basically checks out, but it could use some reworking.
bendin
Thanks for the comment. It's a tricky subject to explain but I'll try my best.
Artelius