views:

1694

answers:

7

What prerequisites are needed to do strict Unicode programming?

Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t?

And what is the role played by multibyte character sequences in this scenario?

+4  A: 

To do strict unicode programming:
- Only use string API that are unicode aware (NOT strlen, strcpy, ... but their widestring counterparts wstrlen, wsstrcpy, ...)
- When dealing with block of text, use an encoding that allows to store unicode chars (utf-7, utf-8, utf-16, ucs-2, ...) without loss.
- Check that your OS default character set is unicode compatible (ex: utf-8)
- Use fonts that are unicode compatible (ex: arial_unicode)

Multibyte character sequences is an encoding that pre-dates UTF-16 encoding (the one used normally with wchar_t) and it seems to me it is rather Windows-only.

I've never heard from wint_t.

HTH

sebastien
wint_t is a type defined in <wchar.h>, just like wchar_t is. It has the same role with respect to wide characters that int has with respect to 'char'; it can hold any wide character value or WEOF.
Jonathan Leffler
+5  A: 

Note that this is not about "strict unicode programming" per se, but some practical experience.

What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).

Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.

When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).

We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).

Hans van Eck
A valid UTF-8 byte sequence would never be cut off (truncated) by strncpy. Valid UTF-8 sequences may not contain any 0x00 bytes (except for the terminating null byte, of course).
Dan Moulding
@Dan Moulding: if you strncpy(), say, a string containing a single chinese character (which may be 3 bytes) into a 2-byte char array, you create an invalid UTF-8 sequence.
Hans van Eck
+1: I like UTF-8 internally as well. Wrappers ftw!
rubenvb
A: 

From what I know, wchar_t is implementation dependent (as can be seen from this wiki article). And it's not unicode.

PolyThinker
+1  A: 

You basically want to deal with strings in memory as wchar_t arrays instead of char. When you do any kind of I/O (like reading/writing files) you can encode/decode using UTF-8 (this is probably the most common encoding) which is simple enough to implement. Just google the RFCs. So in-memory nothing should be multi-byte. One wchar_t represents one character. When you come to serializing however, that's when you need to encode to something like UTF-8 where some characters are represented by multiple bytes.

You'll also have to write new versions of strcmp etc. for the wide character strings, but this isn't a big issue. The biggest problem will be interop with libraries/existing code that only accept char arrays.

And when it comes to sizeof(wchar_t) (you will need 4 bytes if you want to do it right) you can always redefine it to a larger size with typedef/macro hacks if you need to.

Mike Weller
+9  A: 

The C standard (C99) provides for wide characters and multi-byte characters, but since there is no guarantee about what those wide characters can hold, their value is somewhat limited. For a given implementation, they provide useful support, but if your code must be able to move between implementations, there is insufficient guarantee that they will be useful.

Consequently, the approach suggested by Hans van Eck (which is to write a wrapper around the ICU - International Components for Unicode - library) is sound, IMO.

The UTF-8 encoding has many merits, one of which is that if you do not mess with the data (by truncating it, for example), then it can be copied by functions that are not fully aware of the intricacies of UTF-8 encoding. This is categorically not the case with wchar_t.

Unicode in full is a 21-bit format. That is, Unicode reserves code points from U+0000 to U+10FFFF.

One of the useful things about the UTF-8, UTF-16 and UTF-32 formats (where UTF stands for Unicode Transformation Format - see Unicode) is that you can convert between the three representations without loss of information. Each can represent anything the others can represent. Both UTF-8 and UTF-16 are multi-byte formats.

UTF-8 is well known to be a multi-byte format, with a careful structure that makes it possible to find the start of characters in a string reliably, starting at any point in the string. Single-byte characters have the high-bit set to zero. Multi-byte characters have the first character starting with one of the bit patterns 110, 1110 or 11110 (for 2-byte, 3-byte or 4-byte characters), with subsequent bytes always starting 10. The continuation characters are always in the range 0x80 .. 0xBF. There are rules that UTF-8 characters must be represented in the minimum possible format. One consequence of these rules is that the bytes 0xC0 and 0xC1 (also 0xF8..0xFF) cannot appear in valid UTF-8 data.

 U+0000 ..   U+007F  1 byte   0xxx xxxx
 U+0080 ..   U+07FF  2 bytes  110x xxxx   10xx xxxx
 U+0800 ..   U+FFFF  3 bytes  1110 xxxx   10xx xxxx   10xx xxxx
U+10000 .. U+10FFFF  4 bytes  1111 0xxx   10xx xxxx   10xx xxxx   10xx xxxx

Originally, it was hoped that Unicode would be a 16-bit code set and everything would fit into a 16-bit code space. Unfortunately, the real world is more complex, and it had to be expanded to the current 21-bit encoding.

UTF-16 thus is a single unit (16-bit word) code set for the 'Basic Multilingual Plane', meaning the characters with Unicode code points U+0000 .. U+FFFF, but uses two units (32-bits) for characters outside this range. Thus, code that works with the UTF-16 encoding must be able to handle variable width encodings, just like UTF-8 must. The codes for the double-unit characters are called surrogates.

Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from U+D800 to U+DBFF, and trailing, or low, surrogates are from U+DC00 to U+DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.

UTF-32, of course, can encode any Unicode code point in a single unit of storage. It is efficient for computation but not for storage.

You can find a lot more information at the ICU and Unicode web sites.

Jonathan Leffler
I think you are selling `wchar_t` and friends a bit short here. These types are essential in order to allow the C library to handle text in *any* encoding (including non-Unicode encodings). Without the wide character types and functions, the C library would require a set of text-handling functions for *every* supported encoding: imagine having koi8len, koi8tok, koi8printf just for KOI-8 encoded text, and utf8len, utf8tok, utf8printf for UTF-8 text. Instead, we are lucky to have just *one* set of these functions (not counting the original ASCII ones): `wcslen`, `wcstok`, and `wprintf`.
Dan Moulding
All a programmer needs to do is use the C library character conversion functions (`mbstowcs` and friends) to convert any supported encoding to `wchar_t`. Once in `wchar_t` format, the programmer can use the single set of wide text handling functions the C library provides. A good C library implementation will support virtually any encoding most programmers will ever need (on one of my systems, I have access to 221 unique encodings).
Dan Moulding
As far as whether they will be wide enough to be useful: the standard requires an implementation must guarantee that `wchar_t` is wide enough to contain any character supported by the implementation. This means (with possibly one notable exception) most implementations will ensure that they are wide enough that a program that uses `wchar_t` will handle any encoding supported by the system (Microsoft's `wchar_t` is only 16-bits wide which means their implementation does not fully support all encodings, most notably the various UTF encodings, but theirs is the exception not the rule).
Dan Moulding
+3  A: 

This FAQ is a wealth of info. Between that page and this article by Joel Spolsky, you'll have a good start.

One conclusion I came to along the way:

  • wchar_t is 16 bits on Windows, but not necessarily 16 bits on other platforms. I think it's a necessary evil on Windows, but probably can be avoided elsewhere. The reason it's important on Windows is that you need it to use files that have non-ASCII characters in the name (along with the W version of functions).

  • Note that Windows APIs that take wchar_t strings expect UTF-16 encoding. Note also that this is different than UCS-2. Take note of surrogate pairs. This test page has enlightening tests.

  • If you're programming on Windows, you can't use fopen(), fread(), fwrite(), etc. since they only take char * and don't understand UTF-8 encoding. Makes portability painful.

-DB

dbyron
A: 
dan04