UTF8, UTF16, and UTF32

+31 A:

In short:

UTF8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text.
UTF16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text.
UTF32: Fixed-width encoding. All code points take 4 bytes. An enormous memory hog, but fast to operate on. Rarely used.

In long: see Wikipedia: UTF-8, UTF-16, and UTF-32

Adam Rosenfield 2009-01-30 17:10:09

The reason UTF-16 works is that U+D800–U+DFFF are left as a gap in the BMP for the surrogate pair pairs. Clever.

Douglas Leeder 2009-01-30 17:19:11

@Adam: A good summary. +1

AnthonyWJones 2009-01-30 17:20:04

How is "UTF8 is not so good for Asian text"? This is false. UTF-8 is perfect for Japanese, for example. I run a Japanese web site and everything is encoded in UTF8 and all works fine. UTF-8 can encode any Unicode character.

PandaWood 2009-08-04 12:25:47

@spurrymoses: I'm referring strictly to the amount of space taken up by the data bytes. UTF-8 requires 3 bytes per Asian character, while UTF-16 only requires 2 bytes per Asian character. This really isn't a major problem, since computers have tons of memory these days compared to the average amount of text stored in a program's memory.

Adam Rosenfield 2009-08-04 14:19:26

A:

Perhaps this Unicode FAQ helps. Basically, chars in UTF8/16/32 can have a size of 8/16/32 bit, so it's mostly a matter of space and how many different characters you need to use.

schnaader 2009-01-30 17:10:23

I can't tell if this is incorrect or just badly worded. It sounds like you're saying UTF-8 can only handle 256 characters because it only uses eight bits. Please tell me that's not what you meant.

Alan Moore 2009-01-31 05:06:39

Yes, this is indeed very inaccurate and should only give a quick view at what is the difference of the different encodings.

schnaader 2009-01-31 14:20:57

+7 A:

UTF8 is variable 1 to 4 bytes.
UTF16 is variable 2 to 4 bytes.
UTF32 is fixed 4 bytes.

Quassnoi 2009-01-30 17:10:29

+18 A:

UTF-8 has an advantage where ASCII are most prevalent characters. In that case most characters only occupy one byte each. It is also advantageous that UTF-8 file containing only ASCII characters has the same encoding as an ASCII file.

UTF-16 is better where ASCII is not predominant, it uses 2 bytes per character primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 most of the time.

UTF-32 will cover all possible characters in 4 bytes each which makes it pretty bloated, I can't think of any advantage to use it.

AnthonyWJones 2009-01-30 17:15:19

UTF-32 advantage: you don't need to decode stored data to the 32-bit Unicode code point for e.g. character by character handling. The code point is already available right there in your array/vector/string.

rq 2009-01-30 17:48:39

@rq: You're quite right and Adam makes the same point. However, most character by character handling I've seen works with 16 bit short ints not with a vector of 32 bit integers. In terms of raw speed some operations will be quicker with 32 bits.

AnthonyWJones 2009-01-30 18:07:34

It's also easier to parse if (heaven help you) you have to re-implement the wheel.

Paul McMillan 2009-09-29 18:58:33

A:

As mentioned, the difference is primarily the size of the underlying variables, which in each case get larger to allow more characters to be represented.

However, fonts, encoding and things are wickedly complicated (unnecessarily?), so a big link is needed to fill in more detail:

http://www.cs.tut.fi/~jkorpela/chars.html#ascii

Don't expect to understand it all, but if you don't want to have problems later it's worth learning as much as you can, as early as you can (or just getting someone else to sort it out for you).

Paul.

Paul W Homer 2009-01-30 17:17:07

+8 A:

Unicode defines a huge character set, assigning one unique integer value to every graphical symbol. UTF8/16/32 are simply different ways to encode this.

In brief, UTF32 uses 32-bit values for each character. That allows them to use a fixed-width code for every character.

UTF16 uses 16-bit by default, but that only gives you 65k possible characters, which is nowhere near enough for the full Unicode set. So some characters use pairs of 16-bit values.

And UTF8 uses 8-bit values by default, which means that the 127 first values are fixed-width single-byte characters. (the most significant bit is used to signify that this is the start of a multi-byte sequence, leavin 7 bits for the actual character value) All other characters are encoded as sequences of up to 4 bytes (if memory serves).

And that leads us to the advantages. Any ASCII-character is directly compatible with UTF8, so for upgrading legacy apps, UTF8 is a common and obvious choice. It may also use the least memory, assuming that your app uses mostly ASCII characters (since these are encoded as a single byte each). On the other hand, you can't make any guarantees about the width of a character. It may be 1, 2, 3 or 4 characters wide, which makes string manipulation difficult.

UTF32 is opposite, it uses the most memory (each character is a fixed 4 bytes wide), but on the other hand, you know that every character has this precise length, so string manipulation becomes far simpler. You can compute the number of characters in a string simply from the length in bytes of the string. You can't do that with UTF8.

UTF16 is a compromise. It lets most characters fit into a fixed-width 16-bit value. so as long as you don't have Chinese symbols, musical notes or some others, you can assume that each characater is 16 bits wide. It uses less memory than UTF32, and if you have a lot of non-ASCII values, may also use less memory than UTF8.

Finally, it's often helpful to just go with what the platform supports. Windows uses UTF16 internally, so on Windows, that is the obvious choice.

Linux Unicode support varies a bit, but they generally use UTF8 for everything that is Unicode-compliant.

jalf 2009-01-30 17:18:33

A:

Depending on your dev environment you may not even have the choice what encoding your string data type will use internally.

But for storing and exchanging data I would always use UTF8, if you have the choice. If you have mostly ASCII data this will give you the smallest amount of data to transfer, while still being able to encode everything. Optimizing for least I/O is the way to go on modern machines.

mghie 2009-01-30 17:22:19

+4 A:

I just got done reading Joel's article on Unicode from several years back. I think much of it still applies.

http://www.joelonsoftware.com/articles/Unicode.html

Al W 2009-01-30 17:22:28

A:

In UTF32 all of characters are coded with 32 bits. Advantage is that you can easily calculate length of the string. Disadvantage is, that for each ASCII characters you waste extra 3 bytes.

In UTF8 characters have variable length, ASCII characters are coded in 1 byte (8 bits), most western special characters are coded either in 2 bytes or 3 bytes (for example € is 3 bytes), and more exotic characters can take up to 4 bytes. Clear disadvantage is, that a priori you cannot calculate string's length. But it's takes lot less bytes to code Latin (English) alphabet text, compared to UTF32.

UTF16 is also variable length. Characters are coded either in 2 bytes or 4 bytes. I really don't see the point. It has disadvantage of being variable length, but hasn't got the advantage of saving as much space as UTF8.

Of those 3, clearly UTF8 is the most widely spread.

vartec 2009-01-30 17:24:45

A:

Can't we all just use UTF-8? - sort of a plea for the future, I guess. What I did not see mentioned is that the web is turning UTF-8, and for coding, where you work with text files, UTF-8 is the obvious choice.

I'd like to look back in 10 years with UTF16 and 32 as nothing but curiosities. Let all those code points fade away. Goodbye to the 'Text Encoding' menu on your browser.

Tom Andersen 2009-09-29 18:51:48

ansaurus

tags:

views:

answers:

UTF8, UTF16, and UTF32

related questions