ansaurus

Question

Assigning a "const char*" to std::string is allowed, but assigning to std::wstring doesn't compile. Why?

Answer 1

+9 A:

You should do:

#include <string>

int main() {
  const wchar_t h[] = L"hello";
  std::wstring w = h;
  return 0;
}

std::string is a typedef of std::basic_string<char>, while std::wstring is a typedef of std::basic_string<wchar_t>. As such, the 'equivalent' C-string of a wstring is an array of wchar_ts.

The 'L' in front of the string literal is to indicate that you are using a wide-char string constant.

int3 2009-12-06 15:47:55

A good way to handle this is like the win32 api and write a TEXT macro that either leaves the string as it is or prepends the L using the ## macro token. So you could write TEXT("hello") and the macro would expand to the correct form.

Mike Weller 2009-12-06 16:14:31

Answer 2

A:

you should use

#include <tchar.h>

tstring instead of wstring/string TCHAR* instead of char* and _T("hello") instead of "hello" or L"hello"

this will use the appropriate form of string+char, when _UNICODE is defined.

Yossarian 2009-12-06 15:54:14

"(environment: gcc-4.4.1 on Ubuntu Karmic 32bit)" There is no `tchar.h` on my Karmic system. I'm pretty sure it's Windows-specific...

Thomas 2009-12-06 16:02:39

-1 TCHAR is windows specific... Never use it in portable apps.

Artyom 2009-12-06 16:04:01

I'd never use wchar in portable apps.. Windows has much better support for it than linux :]

Yossarian 2009-12-06 16:06:03

The problem is sizeof(Windows::wchar_t)=2, sizeof(AllOtherNonWindowsWorld::wchar_t)=4... Also, UTF-8 is generally much more preferred and less error prone.

Artyom 2009-12-06 16:09:41

@Artyom: yes, especially because ASCII is a strict subset of UTF-8. It makes the transition quite a bit simple.

Tom 2009-12-06 18:05:26

<tchar.h> is provided on Windows only, but the idea itself is trivial and portable.

MSalters 2009-12-07 11:21:52

Answer 3

+6 A:

The relevant part of the string API is this constructor:

basic_string(const charT*);

For std::string, charT is char. For std::wstring it's wchar_t. So the reason it doesn't compile is that wstring doesn't have a char* constructor. Why doesn't wstring have a char* constructor?

There is no one unique way to convert a string of char to a string of wchar. What's the encoding used with the char string? Is it just 7 bit ASCII? Is it UTF-8? Is it UTF-7? Is it SHIFT-JIS? So I don't think it would entirely make sense for std::wstring to have an automatic conversion from char*, even though you could cover most cases. You can use:

w = std::wstring(h, h + sizeof(h) - 1);

which will convert each char in turn to wchar (except the NUL terminator), and in this example that's probably what you want. As int3 says though, if that's what you mean it's most likely better to use a wide string literal in the first place.

Steve Jessop 2009-12-06 15:59:57

Answer 4

+1 A:

Small suggestion... Do not use "Unicode" strings under Linux (a.k.a. wide strings). std::string is perfectly fine and holds Unicode very well (UTF-8).

Most Linux API works with char * strings and most popular encoding is UTF-8.

So... Just don't bother yourself using wstring.

Artyom 2009-12-06 16:04:09

Not true. For example, `string::size()` gives you the wrong answer if your string contains UTF-8 characters that aren't ASCII. It is indeed possible to use `std::string` for this, but you need to be very careful!

Thomas 2009-12-06 16:07:17

There is one advantage of UTF-32 (which is what wchar_t is on linux), which is that it's easy to do stuff like reversing strings. To reverse a UTF-8 string, you have to parse it into distinct characters anyway. So if you're doing a lot of stuff that acts on unicode characters (rather than their constituent UTF-8 bytes), then you want a wide representation.

Steve Jessop 2009-12-06 16:10:33

Does std::wstring::size() gives correct number of characters? NO!!! sizeof(wchar_t) may be 2 and thus, valid codepoints in 0x10000 - 0x1FFFFF would be represented as surrogate pairs, and if you assume that size gives you correct number for wstring your code is WRONG. ;)

Artyom 2009-12-06 16:13:04

@Steve, reversing UTF-32 string char-by-char would give you wrong results. because Character!=CodePoint. For example in hebrew word "שָׁלוֹם" reversed would give you incorrect diacritic points. Because character "שָׁ" consists of 3 code points "ש" and two vowels...

Artyom 2009-12-06 16:17:28

Artyom: oh, yes, I forgot about Windows and Microsoft's half-baked Unicode... On most other systems, wchar_t is the full 32 bits. But even in that case (diacritics, etc.) you still won't get the right answer. I'm not saying that this is necessarily a problem -- but it will be, if you're not aware of it.

Thomas 2009-12-06 16:24:04

Yes, fair point. "easier", not "easy". In some languages, once you've canonicalised your unicode there won't be any combining characters left to worry about. Hebrew evidently is not one of those languages.

Steve Jessop 2009-12-06 16:27:04

Also, reversing Hebrew might be a bad idea to start with. The user will have no idea whether you've deliberately reversed the string, or if it's just that your bi-di rendering is broken ;-)

Steve Jessop 2009-12-06 16:51:58

@Artyom: UTF-32 (UCS-4) is a fixed size format and does __not__ have surrogate pairs thus size() will work as expected. UTF-16 has surrogate pairs (though UCS-2 does not at the code points aer just passed through).

Martin York 2009-12-06 17:13:57

@Martin, Yes, I agree that under Linux wstring is somehow more useful, but unfortunatly sizeof(wchar_t) is not standardized that makes the life very hard when you work with wstring. Beleve me I know (I written Boost.Locale)

Artyom 2009-12-06 19:21:04

@Martin: also, I don't think Artyom is talking about surrogate pairs. Combining diacritical marks are a different thing. As far as I know, you can add as many of them to a character as you like, and even beyond the BMP there is no code point for the example given (a Hebrew letter with two vowels). Correct me if I'm wrong, though.

Steve Jessop 2009-12-06 22:09:07

Answer 5

A:

In addition to the other answers, you could use a trick from Microsoft's book (specifically, tchar.h), and write something like this:

# ifdef APP_USE_UNICODE
    typedef std::wstring AppStringType;
    #define _T(s) (L##s)
# else
    typedef std::string  AppStringType;
    #define _T(s) (s)
# endif

AppStringType foo = _T("hello world!");

(Note: my macro-fu is weak, and this is untested, but you get the idea.)

Thomas 2009-12-06 16:05:42

Answer 6

A:

Looks like you can do something like this:

    #include <sstream>
    // ...
    std::wstringstream tmp;
    tmp << "hello world";
    std::wstring our_string =

Although for a more complex situation, you may want to break down and use mbstowcs

RyanWilcox 2009-12-06 16:22:32

Answer 7

A:

To convert from a multibyte encoding to a wide character encoding, take a look at the header <locale> and the type std::codecvt. The Dinkumware library has a class Dinkum::wstring_convert that makes performing such multibyte-to-wide conversions easier.

The function std::codecvt_byname allows one to find a codecvt instance for a particular named encoding. Unfortunately, discovering the names of the encodings (or locales) on your system is implementation-specific.

seh 2009-12-06 16:29:28

ansaurus

tags:

views:

answers:

Assigning a "const char*" to std::string is allowed, but assigning to std::wstring doesn't compile. Why?

related questions