What are the more portable and clean
ways to handle unicode character
sequences in C and C++ ?
Have all strings in your program be UTF-8, UTF-16, or UTF-32. If for some reason you need to work with a non-Unicode encoding, do the conversion on input and output.
Read unicode strings
Same way you'd read an ASCII file. But there's still a lot of non-Unicode data around, so you'll want to check whether the data is Unicode. If it's not (or if it's UTF-8 when your preferred internal encoding is UTF-32), you'll need to convert it.
- UTF-8 and UTF-32 can be reliably detected by validation.
- UTF-16 can be detected by the presence of a BOM.
- If it's not a UTF encoding, it's likely in ISO-8859-1 or windows-1252.
Convert unicode strings to ASCII to
save some bytes (if the user only
inputs ASCII)
Don't. If your data is all ASCII, then UTF-8 will take exactly the same amount of space. And if it isn't, you'll lose information when you convert to ASCII. If you care about saving bytes.
- Choose the optimal UTF encoding. For characters U+0000 to U+007F, UTF-8 is the smallest. For characters U+0800 to U+FFFF, UTF-16 is the smallest.
- Use data compression like gzip. There is a SCSU encoding specifically designed for Unicode, but I don't know how good it is.
Print unicode strings
Writing UTF-8 is no different from writing ASCII.
Except at the Windows command prompt, because it still uses the old "OEM" code pages. There you can use WriteConsoleW with UTF-16 strings.
Should I use the environment too ?
I've read about LC_CTYPE for example,
should I care about it as a developer
?
LC_CTYPE
is a holdover from the days when every language had its own character encoding, and thus its own ctype.h
functions. Today, the Unicode Character Database takes care of that. The beauty of Unicode is that it separates character encoding handling from locale handling (except for the special uppercase/lowercase rules for Lithuanian, Turkish, and Azeri).
But each language still has its own collation rules and number formatting rules, so you'll still need locales for those. And you'll need to set your locale's character encoding to UTF-8.