views:

1399

answers:

2

I saw that C++0x will add support for UTF-8, UTF-16 and UTF-32 literals. But what about conversions between the three representations ?

I plan to use std::wstring everywhere in my code. But I also need to manipulate UTF-8 encoded data when dealing with files and network. Will C++0x provide also support for these operations ?

+11  A: 

In C++0x, char16_t and char32_t will be used to store UTF-16 and UTF-32 and not wchar_t.

From the draft n2798:

22.2.1.4 Class template codecvt

2 The class codecvt is for use when converting from one codeset to another, such as from wide characters to multibyte characters or between wide character encodings such as Unicode and EUC.

3 The specializations required in Table 76 (22.1.1.1.1) convert the implementation- defined native character set. codecvt implements a degenerate conversion; it does not convert at all. The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters. Specializations on mbstate_t perform conversion between encodings known to the library implementor.

Other encodings can be converted by specializing on a user-defined stateT type. The stateT object can contain any state that is useful to communicate to or from the specialized do_in or do_out members.

The thing about wchar_t is that it does not give you any guarantees about the encoding used. It is a type that can hold a multibyte character. Period. If you are going to write software now, you have to live with this compromise. C++0x compliant compilers are yet a far cry. You can always give the VC2010 CTP and g++ compilers a try for what it is worth. Moreover, wchar_t has different sizes on different platforms which is another thing to watch out for (2 bytes on VS/Windows, 4 bytes on GCC/Mac and so on). There is then options like -fshort-wchar for GCC to further complicate the issue.

The best solution therefore is to use an existing library. Chasing UNICODE bugs around isn't the best possible use of effort/time. I'd suggest you take a look at:

More on C++0x Unicode string literals here

dirkgently
A: 

Thank you dirkgently. I'm not yet registered, so I can't upvote or respond directly as a comment.

I've learned something with codecvt. I knew about the libraries you suggest and the following resource may also be useful http://www.unicode.org/Public/PROGRAMS/CVTUTF/.

The project is for a library that should be open source. I would prefer minimizing the dependencies with external libraries. I already have a dependency with libgc and boost, though for the later I only use threads. I would really prefer to stick to the C++ standard and I'm a bit disappointed that GC supported has been somehow dropped.

Apparently VC++ express 2008 is said to support most of the C++0x standard as well as icc. Since I currently develop with VC++ and it will still take some time until the library would be released, I'd like to give a try to use codecvt and char32_t strings.

Does anyone know how to do this ? Should I post another question ?

chmike
Another question is probably the best thing.
dalle
@chmike: Lack of lambda support in 08 made me look no further. However, I can take a look at the extent of C++0x compatibility in VS2008 (I have Pro). Isn't an open source project best supported by an open source compiler? Just curious (even if 08 express edn is free). Feel free to ask more!
dirkgently
@dirkgently I'm trying to make the package work with VC08, g++ and later with icc. It force me to stick with the standard. This effort helped me find out some bugs that the compilers didn't detect. Some where detected by g++ and others by VC08.
chmike