What are the differences between UTF8, UTF16, and UTF32. I understand that all 3 will store Unicode, and that how it stores the chars is different, but is there an advantage to choosing one over the other?
In short:
- UTF8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text.
- UTF16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text.
- UTF32: Fixed-width encoding. All code points take 4 bytes. An enormous memory hog, but fast to operate on. Rarely used.
Perhaps this Unicode FAQ helps. Basically, chars in UTF8/16/32 can have a size of 8/16/32 bit, so it's mostly a matter of space and how many different characters you need to use.
UTF8
is variable 1 to 4 bytes.UTF16
is variable 2 to 4 bytes.UTF32
is fixed 4 bytes.
UTF-8 has an advantage where ASCII are most prevalent characters. In that case most characters only occupy one byte each. It is also advantageous that UTF-8 file containing only ASCII characters has the same encoding as an ASCII file.
UTF-16 is better where ASCII is not predominant, it uses 2 bytes per character primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 most of the time.
UTF-32 will cover all possible characters in 4 bytes each which makes it pretty bloated, I can't think of any advantage to use it.
As mentioned, the difference is primarily the size of the underlying variables, which in each case get larger to allow more characters to be represented.
However, fonts, encoding and things are wickedly complicated (unnecessarily?), so a big link is needed to fill in more detail:
http://www.cs.tut.fi/~jkorpela/chars.html#ascii
Don't expect to understand it all, but if you don't want to have problems later it's worth learning as much as you can, as early as you can (or just getting someone else to sort it out for you).
Paul.
Unicode defines a huge character set, assigning one unique integer value to every graphical symbol. UTF8/16/32 are simply different ways to encode this.
In brief, UTF32 uses 32-bit values for each character. That allows them to use a fixed-width code for every character.
UTF16 uses 16-bit by default, but that only gives you 65k possible characters, which is nowhere near enough for the full Unicode set. So some characters use pairs of 16-bit values.
And UTF8 uses 8-bit values by default, which means that the 127 first values are fixed-width single-byte characters. (the most significant bit is used to signify that this is the start of a multi-byte sequence, leavin 7 bits for the actual character value) All other characters are encoded as sequences of up to 4 bytes (if memory serves).
And that leads us to the advantages. Any ASCII-character is directly compatible with UTF8, so for upgrading legacy apps, UTF8 is a common and obvious choice. It may also use the least memory, assuming that your app uses mostly ASCII characters (since these are encoded as a single byte each). On the other hand, you can't make any guarantees about the width of a character. It may be 1, 2, 3 or 4 characters wide, which makes string manipulation difficult.
UTF32 is opposite, it uses the most memory (each character is a fixed 4 bytes wide), but on the other hand, you know that every character has this precise length, so string manipulation becomes far simpler. You can compute the number of characters in a string simply from the length in bytes of the string. You can't do that with UTF8.
UTF16 is a compromise. It lets most characters fit into a fixed-width 16-bit value. so as long as you don't have Chinese symbols, musical notes or some others, you can assume that each characater is 16 bits wide. It uses less memory than UTF32, and if you have a lot of non-ASCII values, may also use less memory than UTF8.
Finally, it's often helpful to just go with what the platform supports. Windows uses UTF16 internally, so on Windows, that is the obvious choice.
Linux Unicode support varies a bit, but they generally use UTF8 for everything that is Unicode-compliant.
Depending on your dev environment you may not even have the choice what encoding your string data type will use internally.
But for storing and exchanging data I would always use UTF8, if you have the choice. If you have mostly ASCII data this will give you the smallest amount of data to transfer, while still being able to encode everything. Optimizing for least I/O is the way to go on modern machines.
I just got done reading Joel's article on Unicode from several years back. I think much of it still applies.
In UTF32 all of characters are coded with 32 bits. Advantage is that you can easily calculate length of the string. Disadvantage is, that for each ASCII characters you waste extra 3 bytes.
In UTF8 characters have variable length, ASCII characters are coded in 1 byte (8 bits), most western special characters are coded either in 2 bytes or 3 bytes (for example € is 3 bytes), and more exotic characters can take up to 4 bytes. Clear disadvantage is, that a priori you cannot calculate string's length. But it's takes lot less bytes to code Latin (English) alphabet text, compared to UTF32.
UTF16 is also variable length. Characters are coded either in 2 bytes or 4 bytes. I really don't see the point. It has disadvantage of being variable length, but hasn't got the advantage of saving as much space as UTF8.
Of those 3, clearly UTF8 is the most widely spread.
Can't we all just use UTF-8? - sort of a plea for the future, I guess. What I did not see mentioned is that the web is turning UTF-8, and for coding, where you work with text files, UTF-8 is the obvious choice.
I'd like to look back in 10 years with UTF16 and 32 as nothing but curiosities. Let all those code points fade away. Goodbye to the 'Text Encoding' menu on your browser.