views:

306

answers:

7

Could anyone give me a concise definitions of

  • Unicode
  • UTF7
  • UTF8
  • UTF16
  • UTF32
  • Codepages
  • How they differ from Ascii/Ansi/Windows 1252

I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.

+15  A: 

This is a good start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Tim
The only caveat is that some of the information is out of date (unicode being a moving target), although nothing that the questioner really needs to care about for his level of interest
Kathy Van Stone
Actually, Joe's often referred article was not correct even at the date when it was published (2003).Correct UTF-8 does not go up to 6 bytes (only 4), there is such a thing as "plain text" (has nothing to do with the encoding), UCS is not Unicode lingo (is ISO lingo), wchar_t and L"Hello" is not necessarily Unicode.But hey, he knows more than others, even if some of it is wrong.The message is still the correct one :-)
Mihai Nita
+4  A: 

Here, read this wonderful explanation from the Joel himself.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Developer Art
+6  A: 

As well as the oft-referenced Joel one, I have my own article which looks at it from a .NET-centric viewpoint, just for variety...

Jon Skeet
+2  A: 

Yea I got some insight but it might be wrong, however it's helped me to understand it.

Let's just take some text. It's stored in the computers ram as a series of bytes, the codepage is simply the mapping table between the bytes and characters you and i read. So something like notepad comes along with its codepage and translates the bytes to your screen and you see a bunch of garbage, upside down question marks etc. This does not mean your data is garbled only that the application reading the bytes is not using the correct codepage. Some applications are smarter at detecting the correct codepage to use than others and some streams of bytes in memory contain a BOM which stands for a Byte Order Mark and this can declare the correct codepage to use.

UTF7, 8 16 etc are all just different codepages using different formats.

The same file stored as bytes using different codepages will be of a different filesize because the bytes are stored differently.

They also don't really differ from windows 1252 as that's just another codepage.

For a better smarter answer try one of the links.

Robert
+1  A: 

Others have already pointed out good enough references to begin with. I'm not listing a true Dummy's guide, but rather some pointers from the Unicode Consortium page. You'll find some more nitty-gritty reasons for the usage of different encodings at the Unicode Consortium pages.

The Unicode FAQ is a good enough place to answer some (not all) of your queries.

A more succinct answer on why Unicode exists, is present in the Newcomer's section of the Unicode website itself:

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

As far as the technical reasons for usage of UTF-8, UTF-16 or UTF-32 are concerned, the answer lies in the Technical Introduction to Unicode:

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-32 is popular where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

A general thumb rule is to use UTF-8 when the predominant languages supported by your application are spoken west of the Indus river, UTF-16 for the opposite (east of the Indus), and UTF-32 when you are concerned about utilizing characters with uniform storage.

By the way UTF-7 is not a Unicode standard and was designed primarily for use in mail applications.

Vineet Reynolds
Note that if the text in your application is stored with mark-up (HTML, XML or other similar), then often UTF-8 is more efficient even for Asian languages. For example, when dealing with the web, choosing to use UTF-8 uniformly throughout your workflow is totally reasonable.
MtnViewMark
Yes, I agree with that notion for dealing with the web. However, for thick clients programmed in C/C++ etc. UTF-16 usually makes sense for a Asian language market.
Vineet Reynolds
A: 

Another resource that can be useful to grasp the basic is from my blog.

Stefano Borini
+4  A: 

If you want a really brief introduction: Unicode in 5 Minutes

Or if you are after one-liners:

  • Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use
  • UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use
  • UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding
  • UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent
  • UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation
  • Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments
  • ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8
  • ANSI: a standards body
  • Windows 1252: A commonly used codepage; it is similar to ISO-8861-1, or Latin-1, but not the same, and the two are often confused

Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode

  • Þ (LATIN CAPITAL LETTER THORN)
  • fi (LATIN SMALL LIGATURE FI)
  • ή (GREEK SMALL LETTER ETA WITH TONOS)
  • or 13 other characters, depending on the encoding and character set used.
MtnViewMark