Yes - by being more aware of locales and encodings.
Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.
If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.
Many of these problems arise from C/C++ being generally encoding-agnostic. char
isn't really a character, it's just an integral type. Even using char
arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char
's signed-ness is left undefined by the standards. A statement like str[x] < 0x80
to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char
is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t
exactly, although unsigned char
works as well. Ideally then, I'd make a UTF-8 string an array of uint8_t
s, but due to old APIs, this is rarely done.
Some people have recommended wchar_t
, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t
is no more Unicode than char
. The standard states:
which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales
In Linux, a wchat_t
represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t
if you need portability. (§5.2)
The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char
array that doesn't know what encoding it is in is a recipe for disaster.