UTF-8
UTF-8 is a variable-length code. Some characters require 1 byte, some require 2, some 3 and some 4. The bytes for each character are simply written one after another as a continuous stream of bytes.
While some UTF-8 characters can be 4 bytes long, UTF-8 cannot encode 2^32 characters. It's not even close. I'll try to explain the reasons for this.
The software that reads a UTF-8 stream just gets a sequence of bytes - how is it supposed to decide whether the next 4 bytes is a single 4-byte character, or two 2-byte characters, or four 1-byte characters (or some other combination)? Basically this is done by deciding that certain 1-byte sequences aren't valid characters, and certain 2-byte sequences aren't valid characters, and so on. When these invalid sequences appear, it is assumed that they form part of a longer sequence.
You've seen a rather different example of this, I'm sure: it's called escaping. In many programming languages it is decided that the \
character in a string's source code doesn't translate to any valid character in the string's "compiled" form. When a \ is found in the source, it is assumed to be part of a longer sequence, like \n
or \xFF
. Note that \x
is an invalid 2-character sequence, and \xF
is an invalid 3-character sequence, but \xFF
is a valid 4-character sequence.
Basically, there's a trade-off between having many characters and having shorter characters. If you want 2^32 characters, they need to be on average 4 bytes long. If you want all your characters to be 2 bytes or less, then you can't have more than 2^16 characters. UTF-8 gives a reasonable compromise: all ASCII characters (ASCII 0 to 127) are given 1-byte representations, which is great for compatibility, but many more characters are allowed.
Like most variable-length encodings, including the kinds of escape sequences shown above, UTF-8 is an instantaneous code. This means that, the decoder just reads byte by byte and as soon as it reaches the last byte of a character, it knows what the character is (and it knows that it isn't the beginning of a longer character).
For instance, the character 'A' is represented using the byte 65, and there are no two/three/four-byte characters whose first byte is 65. Otherwise the decoder wouldn't be able to tell those characters apart from an 'A' followed by something else.
But UTF-8 is restricted even further. It ensures that the encoding of a shorter character never appears anywhere within the encoding of a longer character. For instance, none of the bytes in a 4-byte character can be 65.
Since UTF-8 has 128 different 1-byte characters (whose byte values are 0-127), all 2, 3 and 4-byte characters must be composed solely of bytes in the range 128-256. That's a big restriction. However, it allows byte-oriented string functions to work with little or no modification. For instance, C's strstr()
function always works as expected if its inputs are valid UTF-8 strings.
UTF-16
UTF-16 is also a variable-length code; its characters consume either 2 or 4 bytes. 2-byte values in the range 0xD800-0xDFFF are reserved for constructing 4-byte characters, and all 4-byte characters consist of two bytes in the range 0xD800-0xDBFF followed by 2 bytes in the range 0xDC00-0xDFFF. For this reason, Unicode does not assign any characters in the range U+D800-U+DFFF.
UTF-32
UTF-32 is a fixed-length code, with each character being 4 bytes long. While this allows the encoding of 2^32 different characters, only values between 0 and 0x10FFFF are allowed in this scheme.
Capacity comparison:
- UTF-8: 2,097,152 (actually 2,166,912 but due to design details some of them map to the same thing)
- UTF-16: 1,112,064
- UTF-32: 4,294,967,296 (but restricted to the first 1,114,112)
The most restricted is therefore UTF-16! The formal Unicode definition has limited the Unicode characters to those that can be encoded with UTF-16 (i.e. the range U+0000 to U+10FFFF excluding U+D800 to U+DFFF). UTF-8 and UTF-32 support all of these characters.
The UTF-8 system is in fact "artificially" limited to 4 bytes. It can be extended to 8 bytes without violating the restrictions I outlined earlier, and this would yield a capacity of 2^42. The original UTF-8 specification in fact allowed up to 6 bytes, which gives a capacity of 2^31. But RFC 3629 limited it to 4 bytes, since that is how much is needed to cover all of what UTF-16 does.
There are other (mainly historical) Unicode encoding schemes, notably UCS-2 (which is only capable of encoding U+0000 to U+FFFF).