tags:

views:

754

answers:

7

I assumed that std::wstring and std::string both provide more or less the same interface.

So I tried to enable unicode capabilities for our application

# ifdef APP_USE_UNICODE
    typedef std::wstring AppStringType;
# else
    typedef std::string  AppStringType;
# endif

However that gives me a lot of compile errors when -DAPP_USE_UNICODE is used.

It turned out, that the compiler chokes when a const char[] is assigned to std::wstring.

EDIT: improved example by removing the usage of literal "hello".

#include <string>

void myfunc(const char h[]) {
   string  s = h; // compiles OK
   wstring w = h; // compile Error
}

Why does it make such a difference?

Assigning a const char* to std::string is allowed, but assigning to std::wstring gives compile errors.

Shouldn't std::wstring provide the same interface as std::string? At least for such a basic operation as assignment?

(environment: gcc-4.4.1 on Ubuntu Karmic 32bit)

+9  A: 

You should do:

#include <string>

int main() {
  const wchar_t h[] = L"hello";
  std::wstring w = h;
  return 0;
}

std::string is a typedef of std::basic_string<char>, while std::wstring is a typedef of std::basic_string<wchar_t>. As such, the 'equivalent' C-string of a wstring is an array of wchar_ts.

The 'L' in front of the string literal is to indicate that you are using a wide-char string constant.

int3
A good way to handle this is like the win32 api and write a TEXT macro that either leaves the string as it is or prepends the L using the ## macro token. So you could write TEXT("hello") and the macro would expand to the correct form.
Mike Weller
A: 

you should use

#include <tchar.h>

tstring instead of wstring/string TCHAR* instead of char* and _T("hello") instead of "hello" or L"hello"

this will use the appropriate form of string+char, when _UNICODE is defined.

Yossarian
"(environment: gcc-4.4.1 on Ubuntu Karmic 32bit)" There is no `tchar.h` on my Karmic system. I'm pretty sure it's Windows-specific...
Thomas
-1 TCHAR is windows specific... Never use it in portable apps.
Artyom
I'd never use wchar in portable apps.. Windows has much better support for it than linux :]
Yossarian
The problem is sizeof(Windows::wchar_t)=2, sizeof(AllOtherNonWindowsWorld::wchar_t)=4... Also, UTF-8 is generally much more preferred and less error prone.
Artyom
@Artyom: yes, especially because ASCII is a strict subset of UTF-8. It makes the transition quite a bit simple.
Tom
<tchar.h> is provided on Windows only, but the idea itself is trivial and portable.
MSalters
+6  A: 

The relevant part of the string API is this constructor:

basic_string(const charT*);

For std::string, charT is char. For std::wstring it's wchar_t. So the reason it doesn't compile is that wstring doesn't have a char* constructor. Why doesn't wstring have a char* constructor?

There is no one unique way to convert a string of char to a string of wchar. What's the encoding used with the char string? Is it just 7 bit ASCII? Is it UTF-8? Is it UTF-7? Is it SHIFT-JIS? So I don't think it would entirely make sense for std::wstring to have an automatic conversion from char*, even though you could cover most cases. You can use:

w = std::wstring(h, h + sizeof(h) - 1);

which will convert each char in turn to wchar (except the NUL terminator), and in this example that's probably what you want. As int3 says though, if that's what you mean it's most likely better to use a wide string literal in the first place.

Steve Jessop
+1  A: 

Small suggestion... Do not use "Unicode" strings under Linux (a.k.a. wide strings). std::string is perfectly fine and holds Unicode very well (UTF-8).

Most Linux API works with char * strings and most popular encoding is UTF-8.

So... Just don't bother yourself using wstring.

Artyom
Not true. For example, `string::size()` gives you the wrong answer if your string contains UTF-8 characters that aren't ASCII. It is indeed possible to use `std::string` for this, but you need to be very careful!
Thomas
There is one advantage of UTF-32 (which is what wchar_t is on linux), which is that it's easy to do stuff like reversing strings. To reverse a UTF-8 string, you have to parse it into distinct characters anyway. So if you're doing a lot of stuff that acts on unicode characters (rather than their constituent UTF-8 bytes), then you want a wide representation.
Steve Jessop
Does std::wstring::size() gives correct number of characters? NO!!! sizeof(wchar_t) may be 2 and thus, valid codepoints in 0x10000 - 0x1FFFFF would be represented as surrogate pairs, and if you assume that size gives you correct number for wstring your code is WRONG. ;)
Artyom
@Steve, reversing UTF-32 string char-by-char would give you wrong results. because Character!=CodePoint. For example in hebrew word "שָׁלוֹם" reversed would give you incorrect diacritic points. Because character "שָׁ" consists of 3 code points "ש" and two vowels...
Artyom
Artyom: oh, yes, I forgot about Windows and Microsoft's half-baked Unicode... On most other systems, wchar_t is the full 32 bits. But even in that case (diacritics, etc.) you still won't get the right answer. I'm not saying that this is necessarily a problem -- but it will be, if you're not aware of it.
Thomas
Yes, fair point. "easier", not "easy". In some languages, once you've canonicalised your unicode there won't be any combining characters left to worry about. Hebrew evidently is not one of those languages.
Steve Jessop
Also, reversing Hebrew might be a bad idea to start with. The user will have no idea whether you've deliberately reversed the string, or if it's just that your bi-di rendering is broken ;-)
Steve Jessop
@Artyom: UTF-32 (UCS-4) is a fixed size format and does __not__ have surrogate pairs thus size() will work as expected. UTF-16 has surrogate pairs (though UCS-2 does not at the code points aer just passed through).
Martin York
@Martin, Yes, I agree that under Linux wstring is somehow more useful, but unfortunatly sizeof(wchar_t) is not standardized that makes the life very hard when you work with wstring. Beleve me I know (I written Boost.Locale)
Artyom
@Martin: also, I don't think Artyom is talking about surrogate pairs. Combining diacritical marks are a different thing. As far as I know, you can add as many of them to a character as you like, and even beyond the BMP there is no code point for the example given (a Hebrew letter with two vowels). Correct me if I'm wrong, though.
Steve Jessop
A: 

In addition to the other answers, you could use a trick from Microsoft's book (specifically, tchar.h), and write something like this:

# ifdef APP_USE_UNICODE
    typedef std::wstring AppStringType;
    #define _T(s) (L##s)
# else
    typedef std::string  AppStringType;
    #define _T(s) (s)
# endif

AppStringType foo = _T("hello world!");

(Note: my macro-fu is weak, and this is untested, but you get the idea.)

Thomas
A: 

Looks like you can do something like this:

    #include <sstream>
    // ...
    std::wstringstream tmp;
    tmp << "hello world";
    std::wstring our_string =

Although for a more complex situation, you may want to break down and use mbstowcs

RyanWilcox
A: 

To convert from a multibyte encoding to a wide character encoding, take a look at the header <locale> and the type std::codecvt. The Dinkumware library has a class Dinkum::wstring_convert that makes performing such multibyte-to-wide conversions easier.

The function std::codecvt_byname allows one to find a codecvt instance for a particular named encoding. Unfortunately, discovering the names of the encodings (or locales) on your system is implementation-specific.

seh