views:

206

answers:

4

What are the more portable and clean ways to handle unicode character sequences in C and C++ ?

Moreover, how to:

-Read unicode strings

-Convert unicode strings to ASCII to save some bytes (if the user only inputs ASCII)

-Print unicode strings

Should I use the environment too ? I've read about LC_CTYPE for example, should I care about it as a developer ?

A: 

You need to Read, Print or Convert Unicode to ASCII if it fits? Just use UTF-8 and all this would be absolutely transparent for you.

  • Reading, Writing no difference
  • ASCII is already subset of UTF-8

For text analysis/handling use good libraries like ICU, Boost.Locale or even Qt, Glib that give quite good text analysis/handling tools.

Artyom
+2  A: 

What are the more portable and clean ways to handle unicode character sequences in C and C++ ?

Use a library like ICU. If you can't, that is abso-freaking-lutely can't roll your own. Be prepared to have a Hard Time though. Also, do look up Unicode.org documentation on sample source code.

Should I use the environment too ?

Yes. You will probably need to use the std::setlocale function as well. This would allow you to set a locale corresponding to the encoding you want e.g. if you want to use British English as a language and UTF-8 as encoding you'd set LC_CTYPE to en_GB.UTF8.

C++03 does not give you a way to deal with Unicode. Your best bet is to use the wchar_t data type (and by extension std::wstring). However, note that the size and character encoding is different on different OS. E.g. Windows uses 2 bytes for wchar_t and UTF-16 encoding whereas GNU/Linux and Mac OSX use 4 bytes and UTF-32.

C++0x is supposed to amend the situation by allowing Unicode literals codecvt facets, C Unicode TR support (read <uchar.h>) etc. but then that's a long way for most compilers. (There are a few questions here on SO that ought to help you get started.)

dirkgently
-1 std::wstring != **The** Unicode String; std::string is perfectly Unicode string as std::wstring!
Artyom
I did not say `std::wstring` is Unicode.
dirkgently
My point is this: `std::wstring` can be useful for UTF16 (on Windows) and UTF32 (on Mac/Linux). The biggest problem with UTF8 is that it is a variable width encoding and hence a `char` or a `wchar_t` *may* not be able to represent a Unicode character across platforms.
dirkgently
@dirkgently UTF-16 is variable width encoding as well. Also even having access to a single code-point is generally useless. As it does not even represent a character. So, for text analysis you need to use powerful libraries like ICU, for basic use std::string with UTF-8 is as perfect as wide strings.
Artyom
W.r.t UTF-8 yes, either works. And if you note, the very first line of my answer refers to ICU.
dirkgently
+5  A: 

What are the more portable and clean ways to handle unicode character sequences in C and C++ ?

Have all strings in your program be UTF-8, UTF-16, or UTF-32. If for some reason you need to work with a non-Unicode encoding, do the conversion on input and output.

Read unicode strings

Same way you'd read an ASCII file. But there's still a lot of non-Unicode data around, so you'll want to check whether the data is Unicode. If it's not (or if it's UTF-8 when your preferred internal encoding is UTF-32), you'll need to convert it.

  • UTF-8 and UTF-32 can be reliably detected by validation.
  • UTF-16 can be detected by the presence of a BOM.
  • If it's not a UTF encoding, it's likely in ISO-8859-1 or windows-1252.

Convert unicode strings to ASCII to save some bytes (if the user only inputs ASCII)

Don't. If your data is all ASCII, then UTF-8 will take exactly the same amount of space. And if it isn't, you'll lose information when you convert to ASCII. If you care about saving bytes.

  • Choose the optimal UTF encoding. For characters U+0000 to U+007F, UTF-8 is the smallest. For characters U+0800 to U+FFFF, UTF-16 is the smallest.
  • Use data compression like gzip. There is a SCSU encoding specifically designed for Unicode, but I don't know how good it is.

Print unicode strings

Writing UTF-8 is no different from writing ASCII.

Except at the Windows command prompt, because it still uses the old "OEM" code pages. There you can use WriteConsoleW with UTF-16 strings.

Should I use the environment too ? I've read about LC_CTYPE for example, should I care about it as a developer ?

LC_CTYPE is a holdover from the days when every language had its own character encoding, and thus its own ctype.h functions. Today, the Unicode Character Database takes care of that. The beauty of Unicode is that it separates character encoding handling from locale handling (except for the special uppercase/lowercase rules for Lithuanian, Turkish, and Azeri).

But each language still has its own collation rules and number formatting rules, so you'll still need locales for those. And you'll need to set your locale's character encoding to UTF-8.

dan04
excellent overview, in particular since it is avoiding any programming language specific stuff
Jens Gustedt
A: 

There are good answers written here before this one but none of them mentioned one particular thing that I see as a probable problem, since this question has also C tag. My C knowledge is outdated so please correct me if I'm wrong.

Note that presumably zero-terminated strings, traditional C string functions and UTF-16 encoded datastream are likely a tricky combination, because in UTF-16 many western alphanumeric characters will be encoded in two bytes that has the other byte all zeros and therefore reading the character data as series of chars is not what it used to be with single byte charsets.

jasso
You can use 0x0000-terminated strings with UTF-16. ICU (mentioned above) supports this quite extensively. You can't assume UTF-16 fits in an 8-bit char, as you noted.
Steven R. Loomis