Suppose we have an arbitrary string, s.
s has the property of being from just about anywhere in the world. People from USA, Japan, Korea, Russia, China and Greece all write into s from time to time. Fortunately we don't have time travellers using Linear A, however.
For the sake of discussion, let's presume we want to do string operations such as:
- reverse
- length
- capitalize
- lowercase
- index into
and, just because this is for the sake of discussion, let's presume we want to write these routines ourselves (instead of grabbing a library), and we have no legacy software to maintain.
There are 3 standards for Unicode: utf-8, utf-16, and utf-32, each with pros and cons. But let's say I'm sorta dumb, and I want one Unicode to rule them all (because rolling a dynamically adapting library for 3 different kinds of string encodings that hides the difference from the API user sounds hard).
- Which encoding is most general?
- Which encoding is supported by wchar_t?
- Which encoding is supported by the STL?
- Are these encodings all(or not at all) null-terminated?
--
The point of this question is to educate myself and others in useful and usable information for Unicode: reading the RFCs is fine, but there's a 'stack' of information related to compilers, languages, and operating systems that the RFCs do not cover, but is vital to know to actually use Unicode in a real app.