views:

175

answers:

5

Here's my description of Unicode. Please correct and comment.

Unicode separates the representation of a character from the mechanism of storing a character. This is different from ANSI in which each character is represented by a byte.

An ANSI code page maps characters to byte representations. Unicode maps characters to code points. A code point is an abstract concept. It is the responsibility of the encoding scheme to represent the Unicode code points in bytes.

There are many Unicode encoding schemes. Some encoding schemes use a fixed number of bytes to represent a Unicode code point. This approach must balance the number of code points that the encoding can represent and the amount of storage space required. Other encoding schemes use a variable number of bytes to represent a Unicode code point. This approach complicates the parsing of the data but they are avoid the 'scope of representation'/'storage space' constraint that fixed byte length encodings suffer.

UTF-8 is the most common Unicode encoding. The popularity of UTF-8 is due to the fact that it is compatible with ASCII. ASCII is a subset of ANSI which contains the English alphabet, numerals and common punctuation. UTF-8 is a variable length encoding and is capable of encoding all Unicode code points.

A: 

That sounds pretty accurate. You may want to add that UTF-8 is commonly used to store text documents and is commonly used to transfer text over the wire because it is compact, while UTF-16 is also very common as Java and .NET String classes use UTF-16 because it is efficient.

Justice
A: 

A couple of finer points: ASCII compatibility is not the only (or even the main) reason for the popularity of UTF-8 - AFAIK, a very popular side effect of ASCII compatibility is that the byte size of an ASCII string converted to UTF-8 will be the same. In other words, when writing text with few or no non-ASCII characters, you get all of the benefits of ASCII and only a few bytes more for the non-ASCII characters. Also, I believe all official Unicode encodings are able to represent all Unicode code points.

l0b0
+4  A: 

This is probably a good place to mention Joel's what every programmer should know about unicode

Martin Beckett
A: 

I'd get rid of the references to ANSI if I were you. In the context of character sets and encodings, "ANSI" typically refers to the default code page of whatever (Windows) system you're working on. That usually means one of Microsoft's extended or altered versions of an existing standard, like windows-1252 as opposed to ISO-8859-1. Ironically, these extensions have not been blessed by ANSI. This usage of the term "ANSI" was coined by Microsoft and can usually be found in the encoding selection portion of "Save As" dialogs in Microsoft apps like Notepad. There you will usually find an option called "Unicode", which actually means UTF-16 (little-endian, without BOM).

So if you really want to understand Unicode, you should start by throwing out anything you've learned or inferred by seeing it in Windows software (or third-party software that emulates Windows software). In fact, throw out everything you've picked up about Unicode so far and start over from scratch. It's a complex subject, and as with any complex subject, you'll find much more bad information about it than good.

Alan Moore
A: 

UTF-8 is only popular in the western hemisphere. Languages that always needed multibyte encoding gain much more from using UTF-16 or even UTF-32.

Cheers,

Boldewyn