views:

1295

answers:

5
+4  Q: 

utf-8 vs Unicode

I have heard conflicting opinions from people - according to wikipedia, see here

there are the same thing. Are they? can someone clarify?

+10  A: 

They're not the same thing - UTF-8 is a particular way of encoding Unicode.

There are lots of different encodings you can choose from depending on your application and the data you intend to use. The most common are UTF-8, UTF-16 and UTF-32 s far as I know.

Greg
This is a nice, short and accurate answer.
thomasrutter
however, the point is that some editors propose to save the file as "Unicode" OR "UTF-8". So the mention about that "Unicode" in that case is UTF-16 I believe necessary.
serhio
+13  A: 

"Unicode" is a unfortunately used in various different ways, depending on the context. Its most correct use (IMO) is as a coded character set - i.e. a set of characters and a mapping between the characters and integer code points representing them.

UTF-8 is a character encoding - a way of converting from sequences of bytes to sequences of characters and vice versa. It covers the whole of the Unicode character set. ASCII is encoded as a single byte per character, and other characters take more bytes depending on their exact code point (up to 4 bytes for all currently defined code points, i.e. up to U-0010FFFF, and indeed 4 bytes could cope with up to U-001FFFFF).

When "Unicode" is used as the name of a character encoding (e.g. as the .NET Encoding.Unicode property) it usually means UTF-16, which encodes most common characters as two bytes. Some platforms (notably .NET and Java) use UTF-16 as their "native" character encoding. This leads to hairy problems if you need to worry about characters which can't be encoded in a single UTF-16 value (they're encoded as "surrogate pairs") - but most developers never worry about this, IME.

Some references on Unicode:

Jon Skeet
+1: excellent references.
Jon Tackabury
I think UTF-16 only equals "Unicode" on Windows platforms. People tend to use UTF-8 by default on *nix. +1 though, good answer
jalf
If it's relevant, Windows 1252 (ISO 8859-1) is UTF-8 afaik, which most of Europe uses
Chris S
I'll clarify that it's not a unicode standard but a subset of ISO 8859-1, and implemented as 1 byte unicode
Chris S
@Chris: No, ISO-8859-1 is *not* UTF-8. UTF-8 encodes U+0080 to U+00FF as two bytes, not one. Windows 1252 and ISO-8859-1 are *mostly* the same, but they differ between values 0x80 and 0x99 if I remember correctly, where ISO 8859-1 has a "hole" but CP1252 defines characters.
Jon Skeet
Some of your sources are out of date: UTF-8 uses a maximum of four bytes per character, not six. I believe it was reduced primarily to eliminate the "overlong forms" problem described by Markus Kuhn in his FAQ.
Alan Moore
Alan: I originally had it as 4 (see edits) but then read the wrong bit of the document I was reading. Doh. U-04000000 – U-7FFFFFFF would take 6 bytes, but there are no characters above U-001FFFFF - at least at the moment...
Jon Skeet
Last I heard, the maximum Unicode code point is U+0010FFFF -- so there's even more room to grow. It's going to be a while before we have to graft surrogate pairs onto UTF-32, as the author of the accepted answer seems to think is the case. ;-)
Alan Moore
@Alan: Absolutely :)
Jon Skeet
The idea of calling UTF-16 "Unicode" sits uneasily with me due to its potential to confuse - even though this was clearly pointed out as a .NET convention only. UTF-16 is a way of representing Unicode, but it is not "The Unicode encoding".
thomasrutter
@thomasrutter: It's not just a .NET convention. I've seen it in plenty of places. For example, open Notepad and do "Save As" and one of the encoding options is "Unicode". I know it's confusing and inaccurate, but it's worth being aware that it's in fairly widespread use for that meaning.
Jon Skeet
@Alan M: To quote myself: "The Unicode standard defines fewer code points than can be represented in 32 bits." The point is that the UTF family of encodings allow for surrogate pairs, while other encodings do not.
@unwesen: UTF-8 doesn't need surrogate pairs. It just represents non-BMP characters using progressively longer byte sequences.
Jon Skeet
@unwesen: My point was that, unlike UTF-8 and UTF-16, UTF-32 has always been a fixed-width encoding and always will be. Whether it's in the BMP or one of the supplemental planes, every code point is represented by exactly four bytes.
Alan Moore
As for using "Unicode" to mean UTF-16, you're right, Jon: that's a Microsoft convention rather than a .NET convention, and I hate it too. This stuff is difficult enough to explain without MS exposing all its customers to this blatantly incorrect usage.
Alan Moore
+7  A: 

Unicode only define code points, that is, a number which represents a character. How you store these code points in memory depends of the encoding that you are using. UTF-8 is one way of encoding Unicode characters, among many others.

Martin Cote
however, the point is that some editors propose to save the file as "Unicode" OR "UTF-8". So the mention about that "Unicode" in that case is UTF-16 I believe necessary.
serhio
+4  A: 

Unicode is just a standard that defines a character set (UCS) and encodings (UTF) to encode this character set. But in general, Unicode is refered to the character set and not the standard.

Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and Unicode In 5 Minutes.

Gumbo
great article, thanks!
gnomixa
however, the point is that some editors propose to save the file as "Unicode" OR "UTF-8". So the mention about that "Unicode" in that case is UTF-16 I believe necessary.
serhio
@serhio: I know. Although there are three different UTF-16 encodings: The two explicit *UTF-16LE* and *UTF-16BE* and the implicit *UTF-16* where the endianness is specified with a BOM.
Gumbo
+11  A: 

To expand on the answers others have given:

We've got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point.

Computers deal with such numbers as bytes... skipping a bit of history here and ignoring memory addressing issues, 8-bit computers would treat an 8-bit byte as the largest numerical unit easily represented on the hardware, 16-bit computers would expand that to two bytes, and so forth.

Old character encodings such as ASCII are from the (pre-) 8-bit era, and try to cram the dominant language in computing at the time, i.e. English, into numbers ranging from 0 to 127 (7 bits). With 26 letters in the alphabet, both in capital and non-capital form, numbers and punctuation signs, that worked pretty well. ASCII got extended by an 8th bit for other, non-English languages, but the additional 128 numbers/code points made available by this expansion would be mapped to different characters depending on the language being displayed. The ISO-8859 standards are the most common forms of this mapping; ISO-8859-1 and ISO-8859-15 (also known as ISO-Latin-1, latin1, and yes there are two different versions of the 8859 ISO standard as well).

But that's not enough when you want to represent characters from more than one language, so cramming all available characters into a single byte just won't work.

There are essentially two different types of encodings: one expands the value range by adding more bits. Examples of these encodings would be UCS2 (2 bytes = 16 bits) and UCS4 (4 bytes = 32 bits). They suffer from inherently the same problem as ASCII and ISO-8859 standars, as their value range is still limited, even if the limit is vastly higher.

The other type of encoding uses a variable number of bytes per character, and the most commonly known encodings for this are the UTF encodings. All UTF encodings work in roughly the same manner: you choose a unit size, which for UTF-8 is 8 bits, for UTF-16 is 16 bits, and for UTF-32 is 32 bits. The standard then defines a few of these bits as flags: if they're set, then the next unit in a sequence of units is to be considered part of the same character. If they're not set, this unit represents one character fully. Thus the most common (English) characters only occupy one byte in UTF-8 (two in UTF-16, 4 in UTF-32), but other language characters can occupy six bytes or more.

Multi-byte encodings (I should say multi-unit after the above explanation) have the advantage that they are relatively space-efficient, but the downside that operations such as finding substrings, comparisons, etc. all have to decode the characters to unicode code points before such operations can be performed (there are some shortcuts, though).

Both the UCS standards and the UTF standards encode the code points as defined in Unicode. In theory, those encodings could be used to encode any number (within the range the encoding supports) - but of course these encodings were made to encode Unicode code points. And that's your relationship between them.

Windows handles so-called "Unicode" strings as UTF-16 strings, while most UNIXes default to UTF-8 these days. Communications protocols such as HTTP tend to work best with UTF-8, as the unit size in UTF-8 is the same as in ASCII, and most such protocols were designed in the ASCII era. On the other hand, UTF-16 gives the best average space/processing performanc when representing all living languages.

The Unicode standard defines fewer code points than can be represented in 32 bits. Thus for all practical purposes, UTF-32 and UCS4 become the same encoding, as you're unlikely to have to deal with multi-unit characters in UTF-32.

Hope that fills in some details.