tags:

views:

130

answers:

4

Suppose we have an arbitrary string, s.

s has the property of being from just about anywhere in the world. People from USA, Japan, Korea, Russia, China and Greece all write into s from time to time. Fortunately we don't have time travellers using Linear A, however.

For the sake of discussion, let's presume we want to do string operations such as:

  • reverse
  • length
  • capitalize
  • lowercase
  • index into

and, just because this is for the sake of discussion, let's presume we want to write these routines ourselves (instead of grabbing a library), and we have no legacy software to maintain.

There are 3 standards for Unicode: utf-8, utf-16, and utf-32, each with pros and cons. But let's say I'm sorta dumb, and I want one Unicode to rule them all (because rolling a dynamically adapting library for 3 different kinds of string encodings that hides the difference from the API user sounds hard).

  • Which encoding is most general?
  • Which encoding is supported by wchar_t?
  • Which encoding is supported by the STL?
  • Are these encodings all(or not at all) null-terminated?

--

The point of this question is to educate myself and others in useful and usable information for Unicode: reading the RFCs is fine, but there's a 'stack' of information related to compilers, languages, and operating systems that the RFCs do not cover, but is vital to know to actually use Unicode in a real app.

+8  A: 
  1. Which encoding is most general
    Probably UTF-32, though all three formats can store any character. UTF-32 has the property that every character can be encoded in a single codepoint.

  2. Which encoding is supported by wchar_t
    None. That's implementation defined. On most Windows platforms it's UTF-16, on most Unix platforms its UTF-32.

  3. Which encoding is supported by the STL
    None really. The STL can store any type of character you want. Just use the std::basic_string<t> template with a type large enough to hold your code point. Most operations (i.e. std::reverse) do not know about any sort of unicode encoding though.

  4. Are these encodings all(or not at all) null-terminated?
    No. Null is a legal value in any of those encodings. Technically, NULL is a legal character in plain ASCII too. NULL termination is a C thing -- not an encoding thing.

Choosing how to do this has a lot to do with your platform. If you're on Windows, use UTF-16 and wchar_t strings, because that's what the Windows API uses to support unicode. I'm not entirely sure what the best choice is for UNIX platforms but I do know that most of them use UTF-8.

Billy ONeal
Even with UTF-32 you can't store every character as a single codepoint. That encoding simply ensures 1:1 mapping between code units and code points (for the details on terminology, check out unicode.org)
Nemanja Trifunovic
Err.. actually, it can. Unicode requires 21 bits for the full set of characters. UTF-32 provides 32 bits in a single codepoint. Characters should never need to be split on UTF-32. You're thinking of UTF-16.
Billy ONeal
@Billy. You are talking about code points here, not characters. Some (in fact many) characters need to be described with multiple code points, regardless of the encoding. Take a look at this link, for instance: http://www.unicode.org/faq/char_combmark.html
Nemanja Trifunovic
Oh-- I see what you're saying now. The docs on `encoded character` on unicode.org do not make that inherently obvious though.
Billy ONeal
Good point about combining characters. "a umlaut o" equals "äo", and should be reversed as "oä". Reversing it as "o umlaut a" would produce "öa" - notice the umlaut jumping.
MSalters
+3  A: 

Have a look at the open source library ICU, especially at the Docs & Papers section. It's an extensive library dealing with all sorts of unicode oddities.

Malte Clasen
The OP explicitly asked for a non-library answer.
Billy ONeal
Malte Clasen
Ah -- I see. +1 then.
Billy ONeal
+1  A: 

Define "real app" :)

Seriously, the decision really depends a lot on the kind of software you are developing. If your target platform is Win32 API (with or without wrappers such as MFC, WTL, etc) you would probably want to use wstring types with the text encoded as UTF-16. That's simply because all Win32 API internally uses that encoding anyway.

On another hand, if your output is something like XML/HTML and/or needs to be delivered over the internet, UTF-8 is pretty much the standard - it is usually transmitted well via protocols that make assumptions about characters having 8 bits.

As for UTF-32, I can't think of a single reason to use it, unless you need 1:1 mapping between code units and code points (that still does not mean 1:1 mapping between code units and characters!).

For more information, be sure to look at Unicode.org. This FAQ may be a good starting point.

Nemanja Trifunovic
One thing I'm not clear on: Can any of the UTF encodings represent all glyphs used in all living language writings today? That is, if I select UTF-8 or UTF-16, would I lock myself out of certain markets?
Paul Nathan
@Paul. UTF-8, UTF-16 and UTF-32 describe exactly the same data (Unicode code points) only differently encoded, and strictly technically speaking you can use any of them to store any text covered by Unicode standard (all living languages are covered). Having said that, you'll need to take into account non-technical issues: for instance, China mandates use of GB18030 even if standard Unicode encoding forms cover Chinese letters as well.
Nemanja Trifunovic
A: 

In response to your final bullet, UTF-8 is guaranteed not to have NULL bytes in its encoding of any character (except NULL itself, of course). As a result, many functions that work with NULL-terminated strings also work with UTF-8 encoded strings.

Dave Taflin