views:

2297

answers:

5

I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows).

+27  A: 

Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html

Dylan Beattie
Hehe, when I read the title this was exactly the article that came to my mind.
VVS
I hadn't read that before... got my i18n training through other avenues. Thanks for the link
Akrikos
+17  A: 

A character encoding consists of a sequence of codes that each look up a symbol from a given character set. Please see this good article on Wikipedia on character encoding.

UTF8 (UCS) uses 1 to 4 bytes for each symbol. Wikipedia gives a good rundown of how the multi-byte rundown works:

  • The most significant bit of a single-byte character is always 0.
  • The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, and so on.
  • The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.
  • A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a UTF-8 stream never looks like a UTF-16 stream starting with U+FEFF (Byte-order mark)

The page also shows you a great comparison between the advantages and disadvantages of each character encoding type.

UTF16 (UCS2)

Uses 2 bytes to 4 bytes for each symbol.

UTF32 (UCS4)

uses 4 bytes always for each symbol.

char just means a byte of data and is not an actual encoding. It is not analogous to UTF8/UTF16/ascii. A char* pointer can refer to any type of data and any encoding.

STL:

Both stl's std::wstring and std::string are not designed for variable-length character encodings like UTF-8 and UTF-16.

How to implement:

Take a look at the iconv library. iconv is a powerful character encoding conversion library used by such projects as libxml (XML C parser of Gnome)

Other great resources on character encoding:

Brian R. Bondy
Brian, this is wrong. UTF-16 uses 2 to 4 bytes. Only UTF-32 has a fixed width of bytes (= 4). Most UTF-16 implementations simply don't extend beyond the BMP and thus only support a limited character set.
Konrad Rudolph
Thanks Konrad, I updated my description.
Brian R. Bondy
Personally, I'd consider using a char* to point to UTF16 data to be a bug.
I guess it depends on the context, for example if I was looking at it as a buffer of data I'd see no problem with this.
Brian R. Bondy
@Konrad Rudolph: these UTF-16 implementations that don't extend beyond the BMP are not UTF-16, but UCS-2. MS Windows comes to mind. UTF-16 supports the full Unicode range.
ΤΖΩΤΖΙΟΥ
Originally UTF-8 used up to six bytes per character (beyond BMP), so you may encounter this encoding.
erickson
ΤΖΩΤΖΙΟΥ: MS (and other vendors, as well!) explicitly label them as (incomplete) UTF-16! That this also happens to be UCS-2 is more like a coincidence (although it clearly isn't because Unicode the UCS encodings were designed with compatibility in mind). Technically, there's no difference.
Konrad Rudolph
Rudolph: read the wikipedia article: http://en.wikipedia.org/wiki/UCS-2 . UCS-2 is a predecessor to UTF-16 and obsolete. Again, UTF-16 supports the full unicode range through surrogate pairs, while UCS-2 supports only the BMP (U+0000 to U+FFFF). MS "UTF-16" supports only the BMP, ergo it is UCS-2.
ΤΖΩΤΖΙΟΥ
Codepoint 1f030, "DOMINO TILE HORIZONTAL BLACK", can be encoded as UTF-16 but not as UCS-2; isn't that a technical difference?
ΤΖΩΤΖΙΟΥ
char is not necessarily a byte of data. for example in C# :sizeof(char)==2
dtroy
Perhaps the fact that languages have a type "char" is just an vestige from a time when character encodings were much simpler. Using a "char", or "wchar" or really any fixed width type to represent a character is probably not a good idea. Perhaps new languages shouldn't have "char", but instead just uint8_t, or byte.I typically use uint8_t *, or void * to point to data that I think of as a "bag of bytes" like a string where I have the encoding stored in some other variable.
Jon Hess
+3  A: 

The various UTF standards are ways to encode "code points". A codepoint is the index into the Unicode charater set.

Another encoding is UCS2 which is allways 16bit, and thus doesn't support the full Unicode range.

Good to know is also that one codepoint isn't equal to one character. For example a character such as å can be represented both as a code point or as two code points one for the a and one for the ring.

Comparing two unicode strings thus requires normalization to get the canonical representation before comparison.

John Nilsson
+1  A: 

There is also the issue with fonts. There are two ways to handle fonts. Either you use a gigantic font with glyphs for all the Unicode characters you need (I think recent versions of Windows comes with one or two such fonts). Or you use som library capable of combining glyphs from various fonts dedicated to subsets of the Unicode standard.

John Nilsson
+11  A: 

Received wisdom suggests that Spolsky's article misses a couple of important points.

This article is recommended as being more complete: The Unicode® Standard: A Technical Introduction

This article is also a good introduction: Unicode Basics

The latter in particular gives an overview of the character encoding forms and schemes for Unicode.

mmalc