tags:

views:

259

answers:

3

I've recently tried to get the full picture about what steps to take to create plattform independent C++ applikations that support unicode. A thing that is confusing to me is that most howtos and stuff equalize the character encoding (i.e. ANSI or Unicode) and the character datatype (char or wchar_t). As far as I've learned so far these are different things and there may exist a character sequence encodeded in Unicode but represented by std::string as well as a character sequence encoded in ANSI but represented as std::wstring, Right?

So the question that arises to me is whether the C++ standard gives any quarantee about the encoding of string literals starting with L or does it just say it's of type wchar_t with compiler implementation specific character encoding?

If there is no such quaranty does that mean I need some sort of external resource system to provide non ASCII string literals for my applikation in a platform independent way? What is the prefered way for this? Resource system or proper encoding of source files plus proper compiler options?

+1  A: 

The standard makes no mention of encoding formats for strings.

Take a look at ICU from IBM (its free). http://site.icu-project.org/

Martin York
+4  A: 

The L symbol in front of a string literal simply means that each character in the string will be stored as a wchar_t. But this doesn't necessarily imply Unicode. For example, you could use a wide character string to encode GB 18030, a character set used in China which is similar to Unicode. The C++03 standard doesn't have anything to say about Unicode, so it's up to you to properly represent Unicode strings.

Regarding string literals, Chapter 2 (Lexical Conventions) of the C++ standard mentions a "basic source character set", which is basically equivalent to ASCII. So this essentially guarantees that "abc" will be represented as a 3-byte string (not counting the null), and L"abc" will be represented as a 3 * sizeof(wchar_t)-byte string of wide-characters.

The standard also mentions "universal-character-names" which allow you to refer to non-ASCII characters using the \uXXXX hexadecimal notation. These "universal-character-names" usually map directly to Unicode values, but the standard doesn't guarantee that they have to. However, you can at least guarantee that your string will be represented as a certain sequence of bytes by using universal-character-names. This will guarantee Unicode output provided the runtime environment supports Unicode, has the appropriate fonts installed, etc.

As for string literals in source files, again there is no guarantee. If you have a Unicode string literal in your code which contains characters outside of the ASCII range, it is up to your compiler to decide how to interpret these characters. If you want to explicitly guarantee that the compiler will "do the right thing", you'd need to use \uXXXX notation in your string literals.

Charles Salvia
Nice and exhaustive answer. It's probably worth adding info about some specific (but popular) platforms that guarantee Unicode - e.g. all Windows implementations I know of, as well as GNU libc, do that. On the other hand, FreeBSD does not guarantee that, and has some locales where wide strings aren't Unicode. Also, C99 added a preprocessor symbol that an implementation can define if all its string functions treat wide strings as Unicode regardless of locale - `__STDC_ISO10646__` - e.g. GNU libc defines it. Unfortunately, MSVC does not, even though it matches the semantics.
Pavel Minaev
As well, `\u` actually does guarantee mapping to Unicode - something like `\u1234` is _always_ the character represented by Unicode codepoint 1234, so long as the execution wide character set supports it. However, even if character is supported, there's no requirement that `L'\u1234' == 1234`, as the character may well be remapped. See ISO C++ 2.2[lex.charset]/2 for details.
Pavel Minaev
In addition to Pavel's comments: The mac OS is also based around UTF-16(UCS-2)
Martin York
A: 

The C++03 does not mention unicode (future C++0x does). Currently you have to either use external libraries (ICU, UTF-CPP, etc.) or build your own solution using platform specific code. As others have mentioned, wchar_t encoding (or even size) is not specified. Consequently, string literal encoding is implementation specific. However, you can give unicode codepoints in string literals by using \x \u \U escapes.

Typically unicode apps in Windows use wchar_t (with UTF-16 encoding) as internal character format, because it makes using Windows APIs easier as Windows itself uses UTF-16. Unix/Linux unicode apps in turn usually use char (with UTF-8 encoding) internally. If you want to exchange data between different platforms, UTF-8 is usual choice for data transfer encoding.

eidolon
C++03 mentions ISO10646, which is ISO's equivalent of Unicode though, and since C++ is an ISO standard they refer to the other ISO standard instead of Unicode. In practice, there are no differences (same values for same characters, for instance)
MSalters
Yes, thats why one can give codepoints by \u \U escapes. However, ISO10646 (UCS) is not same as unicode - (unicode) "imposes additional constraints on implementations", and "supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646".
eidolon