ansaurus

Question

Answer 1

+3 A:

What do you mean by "portable wchar_t"? There is a uint16_t type that is 16bits wide everywhere, which is often available. But that of course doesn't make up a string yet. A string has to know of its encoding to make sense of functions like length(), substring() and so on (so it doesn't cut characters in the middle of a code point when using utf8 or 16). There are some unicode compatible string classes i know of that you can use. All can be used in commercial programs for free (the Qt one will be compatible with commercial programs for free in a couple of months, when Qt 4.5 is released).

ustring from the gtkmm project. If you program with gtkmm or use glibmm, that should be the first choice, it uses utf-8 internally. Qt also has a string class, called QString. It's encoded in utf-16. ICU is another project that creates portable unicode string classes, and has a UnicodeString class that internally seems to be encoded in utf-16, like Qt. Haven't used that one though.

Johannes Schaub - litb 2009-01-21 21:50:42

Actually length(), substring() and Co. have no clue of the encoding, they just look at the coding unit size and work on that.

Mihai Nita 2009-07-24 08:41:12

@Mihai, sure that's the case of std::string's function. But that's the reason it can't be used for utf8 etc.

Johannes Schaub - litb 2009-07-24 13:42:29

@Johannes Schaub: but the answer states "A string has to know of its encoding to make sense of functions like length(), substring()"So no, it does not have to know. You can work in terms of code units without knowing the encoding, all you need is the size of the code unit.

Mihai Nita 2010-01-12 01:24:35

@Mihai, if you use utf8, then code units have 8 bit size - but to calculate the length of a string, knowing that is not sufficient at all. You have to consider continuation bytes, and stuff. Otherwise you won't get the length of the string, but just the code unit count. Of course for fixed-length encodings like ASCII that won't matter, and knowing the code unit size is all that matters there.

Johannes Schaub - litb 2010-01-12 21:12:36

It depends on what you mean by "the length of a string". If you're allocating memory or reporting disk usage, it *is* the number of code units that matters. If you care about the number of *characters*, you do need to know the encoding. And if you care about how many *columns* your strings takes up in a text terminal, that's yet another matter.

dan04 2010-06-30 01:46:52

Answer 2

+4 A:

If you're dealing with use internal to the program, don't worry about it; a wchar_t in class A is the same as in class B.

If you're planning to transfer data between Windows and Linux/MacOSX versions, you've got more than wchar_t to worry about, and you need to come up with means to handle all the details.

You could define a type that you'll define to be four bytes everywhere, and implement your own strings, etc. (since most text handling in C++ is templated), but I don't know how well that would work for your needs.

Something like typedef int my_char; typedef std::basic_string<my_char> my_string;

David Thornley 2009-01-21 21:52:08

You would need char_traits for that, and you can't specialize std::char_traits<int> (per namespace std rules).

MSalters 2009-01-23 13:33:04

Also, you can simply use wchar_t/wstring internally. Externally, you use UTF-8 to bypass the endiannness mess. On I/O, convert between wchar_t and UTF-8 using template functions specialized on sizeof(wchar_t).

MSalters 2009-01-23 13:35:27

-1 Using my_char is bad idea. You can write string to stream you can't do anything with it

Artyom 2010-08-20 13:10:27

Answer 3

A:

See http://stackoverflow.com/questions/421530/is-endian-conversion-required-for-wchart-data for a related question.

Andrew 2010-01-15 21:46:43

Answer 4

A:

The proposed C++0x standard will have char16_t and char32_t types. Until then, you'll have to fall back on using integers for the non-wchar_t character type.

#if defined(__STDC_ISO_10646__)
    #define WCHAR_IS_UTF32
#elif defined(_WIN32) || defined(_WIN64)
    #define WCHAR_IS_UTF16
#endif

#if defined(__STDC_UTF_16__)
    typedef _Char16_t CHAR16;
#elif defined(WCHAR_IS_UTF16)
    typedef wchar_t CHAR16;
#else
    typedef uint16_t CHAR16;
#endif

#if defined(__STDC_UTF_32__)
    typedef _Char32_t CHAR32;
#elif defined(WCHAR_IS_UTF32)
    typedef wchar_t CHAR32;
#else
    typedef uint32_t CHAR32;
#endif

According to the standard, you'll need to specialize char_traits for the integer types. But on Visual Studio 2005, I've gotten away with std::basic_string<CHAR32> with no special handling.

I plan to use a SQLite database.

Then you'll need to use UTF-16, not wchar_t.

The SQLite API also has a UTF-8 version. You may want to use that instead of dealing with the wchar_t differences.

dan04 2010-08-20 13:08:53

Answer 5

+1 A:

My suggestion. Use UTF-8 and std::string. Wide strings would not bring you too much added value. As you anyway can't interpret wide character as letter as some characters crated from several unicode code points.

So use anywhere UTF-8 and use good library to deal with natural languages. Like for example Boost.Locale.

Bad idea: define something like typedef uint32_t mychar; is bad. As you can't use iostream with it, you can't create for example stringstream based in this character as you would not be able to write in it.

For example this would not work:

std::basic_ostringstream<unsigned> s;
ss << 10;

Would not create you a string.

Artyom 2010-08-20 13:16:06

ansaurus

tags:

views:

answers:

Portable wchar_t in C++

related questions