views:

14056

answers:

9

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:

  1. When should I use std::wstring over std::string?
  2. Can std::string hold the entire ASCII character set, including the special characters?
  3. Is std::wstring supported by all popular C++ compilers?
  4. What is exactly a wide character?

Thanks in advance for the help.

+5  A: 
  1. When you want to store 'wide' (Unicode) characters.
  2. Yes: 255 of them (excluding 0).
  3. Yes.
  4. Here's an introductory article: http://www.joelonsoftware.com/articles/Unicode.html
ChrisW
std::string can hold 0 just fine (just be careful if you call the c_str() method)
Mr Fooz
And strictly speaking, a char isn't guaranteed to be 8 bits. :)Your link in #4 is a must-read, but I don't think it answers the question. A wide character is strictly nothing to do with unicode. It is simply a wider character. (How much wider depends on OS, but typically 16 or 32 bit)
jalf
A: 
  1. when you want to use Unicode strings and not just ascii, helpful for internationalisation
  2. yes, but it doesn't play well with 0
  3. not aware of any that don't
  4. wide character is the compiler specific way of handling the fixed length representation of a unicode character, for MSVC it is a 2 byte character, for gcc I understand it is 4 bytes. and a +1 for http://www.joelonsoftware.com/articles/Unicode.html
Greg Domjan
2. An std::string can hold a NULL character just fine. It can also hold utf-8 and wide characters as well.
@Juan : That put me into confusion again. If std::string can keep unicode characters, what is special with std::wstring?
@Appu: std::string can hold UTF-8 unicode characters. There are a number of unicode standards targeted at different character widths. UTf8 is 8 bits wide. There's also UTF-16 and UTF-32 at 16 and 32 bits wide respectively
Greg D
With a std::wstring. Each unicode character can be one wchar_t when using the fixed length encodings. For example, if you choose to use the joel on software approach as Greg links to. Then the length of the wstring is exactly number of unicode characters in the string. But it takes up more space
I didn't say it could not hold a 0 '\0', and what I meant by doesn't play well is that some methods may not give you an expected result containing all the data of the wstring. So harsh on the down votes.
Greg Domjan
I didn't mean to offend. But I didn't agree with your answers to both 1 and 2. I can see from Joel's argument why you may want to use wchar_t when working on a windows system. However, a regular char works just as well for i18n.
A: 

1) As mentioned by Greg, wstring is helpful for internationalization, that's when you will be releasing your product in languages other than english

4) Check this out for wide character http://en.wikipedia.org/wiki/Wide_character

Raghu
+3  A: 

I frequently use std::string to hold utf-8 characters without any problems at all. I heartily recommend doing this when interfacing with API's which use utf-8 as the native string type as well.

For example, I use utf-8 when interfacing my code with the Tcl interpreter.

The major caveat is the length of the std::string, is no longer the number of characters in the string.

Juan : Do you mean that std::string can hold all unicode characters but the length will report incorrectly? Is there a reason that it is reporting incorrect length?
When using the utf-8 encoding, a single unicode character may be made up of multiple bytes. This is why utf-8 encoding is smaller when using mostly characters from the standard ascii set.You need to use special functions (or roll your own) to measure the number of unicode characters.
(Windows specific) Most functions will expect that a string using bytes is ASCII and 2 bytes is Unicode, older versions MBCS. Which means if you are storing 8 bit unicode that you will have to convert to 16 bit unicode to call a standard windows function (unless you are only using ASCII portion).
Greg Domjan
As Greg and Joel (on software) mention, it is really important to understand how the encoding works with the API you are dealing with. Constantly changing back and forth between 8 and 16 bit encoding on a windows system may not be optimal.
+9  A: 
  1. When you want to have wide characters stored in your string. wide depends on the implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target. It's 32 bits long here. Please note wchar_t (wide character type) has nothing to do with unicode. It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char. You can store unicode strings fine into std::string using the utf-8 encoding too. But it won't understand the meaning of unicode code points. So str.size() won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring. For that reason, the gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.

    If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding. This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters.

  2. Yes, char is always at least 8 bit long, which means it can store all ASCII values.
  3. Yes, all major compilers support it.
Johannes Schaub - litb
I'm curious about #2. I thought 7 bits would be technically valid too? Or is it required to be able to store anything past 7-bit ASCII chars?
jalf
yes, jalf. c89 specifies minimal ranges for basic types in its documentation of limits.h (for unsigned char, that's 0..255 min), and a pure binary system for integer types. it follows char, unsigned char and signed char have minimum bit lengths of 8. c++ inherits those rules.
Johannes Schaub - litb
Ah cool, thanks. :)
jalf
"This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters." This is not entirely accurate, even for Unicode. It would be more accurate to say codepoint than "logical character", even in UTF-32 a given character may be composed of multiple codepoints.
Logan Capaldo
A: 

Actually, std::wstring should be the default answer for 1. Use an 8-bit-string only if there's a compelling reason to not ever support Unicode. Not supporting Unicode out of a whim has been a bad mistake these past years.

Then again, and related to your question 4, it's very ill-defined in the current C++ standard what exactly a wide-character string is. This drastically reduces its usefulness. There's simply no platform-independent way in standard C++ to handle Unicode strings. Notice that you can program Unicode-aware in C++ but this is very hard. Hopefully, the situation will become better with the next standard, where explicit Unicode support is added.

Konrad Rudolph
I don't think developers should be using std::wstring unless they're actively internationalizing their applications. A poor/half-ass internationalization effort is worse than no effort at all.
Tom
Most developers need wstring even if they're not internationalizing, only a few developers (mostly US/UK) can get by with ASCII.
MSalters
+109  A: 

string? wstring?

std::string is a basic_string templated on a char, and std::wstring on a wchar_t.

char vs. wchar_t

char is supposed to hold a character, usually a 1-byte character. wchar_t is supposed to hold a wide character, and then, things get tricky: On Linux, a wchar_t is 4-bytes, while on Windows, it's 2-bytes

what about Unicode, then?

The problem is that neither char nor wchar_t is directly tied to unicode.

On Linux?

Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:

#include <cstring>
#include <iostream>

int main(int argc, char* argv[])
{
   const char text[] = "olé" ;
   const wchar_t wtext[] = L"olé" ;

   std::cout << "sizeof(char)    : " << sizeof(char) << std::endl ;
   std::cout << "text            : " << text << std::endl ;
   std::cout << "sizeof(text)    : " << sizeof(text) << std::endl ;
   std::cout << "strlen(text)    : " << strlen(text) << std::endl ;

   std::cout << "text(binary)    :" ;

   for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
   {
      std::cout << " " << static_cast<unsigned int>(static_cast<unsigned char>(text[i])) ;
   }

   std::cout << std::endl << std::endl ;

   std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl ;
   //std::cout << "wtext           : " << wtext << std::endl ;
   std::cout << "wtext           : UNABLE TO CONVERT NATIVELY." << std::endl ;
   std::cout << "sizeof(wtext)   : " << sizeof(wtext) << std::endl ;
   std::cout << "wcslen(wtext)   : " << wcslen(wtext) << std::endl ;

   std::cout << "wtext(binary)   :" ;

   for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
   {
      std::cout << " " << static_cast<unsigned int>(static_cast<unsigned short>(wtext[i])) ;
   }

   std::cout << std::endl << std::endl ;


   return 0;
}

outputs the following text:

sizeof(char)    : 1
text            : olé
sizeof(text)    : 5
strlen(text)    : 4
text(binary)    : 111 108 195 169

sizeof(wchar_t) : 4
wtext           : UNABLE TO CONVERT NATIVELY.
sizeof(wtext)   : 16
wcslen(wtext)   : 3
wtext(binary)   : 111 108 233

You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t code as an exercice)

So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.

Note that std::string, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.

On Windows?

On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.

So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine. For example, "olé" would be "olé" in a french-localized Windows, but would be something différent on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.

For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, the mostly compatible UCS-2, which is almost the same thing IIRC).

Applications using char are said "multibyte" (because each glyph is composed of one or more chars), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.

Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK+ or QT...). The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText (low level API function to set the label on a Win32 GUI).

Memory issues?

UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).

If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.

Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or larger for UTF-8 than for UTF-16.

All in all, UTF-16 will mostly use 2 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.

See http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.

Conclusion

1. When I should use std::wstring over std::string?

On Linux? Almost never (§).
On Windows? Almost always (§).
On cross-plateform code? Depends on your toolkit...

(§) : unless you use a toolkit/framework saying otherwise

2. Can std::string hold all the ASCII character set including special characters?

On Linux? Yes.
On Windows? Only special characters available for the current locale of the Windows user.

Edit (After a comment from Johann Gerell): a std::string will be enough to handle all char based strings (each char being a number from 0 to 255). But:

  1. ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
  2. a char from 0 to 127 will be held correctly
  3. a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.

3. Is std::wstring supported by almost all popular C++ compilers?

I guess, so.
It works on my g++ 4.3.2, and I used Unicode API on Win32 since Visual C++ 6.

4. What is exactly a wide character?

On C/C++, it's a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...)

paercebal
Mostly good, but just a note on your wording in conclusion 2: ASCII is singular, there are no "all the ASCII character set". The codepages only overload characters above the ASCII range, namely between 128-255.
Johann Gerell
Hum. I didn't know that windows did not follow the POSIX spec in this regard. POSIX says that a wchar_t must be able to represent "distinct wide-character codes for all members of the largest character set specified among the locales supported by the compilation environment".
gnud
@Johann Gerell: You're right... I'll clarify it
paercebal
@gnud: Perhaps wchar_t was supposed to be enough to handle all UCS-2 chars (most UTF-16 chars) before the advent of UTF-16... Or perhaps Microsoft did have other priorities than POSIX, like giving easy access to Unicode without modifying the codepaged use of char on Win32.
paercebal
@gnud: Note the definition of wchar_t, quoted on Wikipedia: http://en.wikipedia.org/wiki/Wchar_t ... Apparently, whcar_t on Windows follows what was asked by Unicode... ^_^ ...
paercebal
Your response does explain very well the differences between the two alternatives. Remark: UTF-8 can take 1-6 bytes and not 1-4 like you wrote. Also I would like to see people opinion between the two alternatives.
Sorin Sbarnea
@Sorin Sbarnea: UTF-8 could take 1-6 bytes, but apparently the standard limits it to 1-4. See http://en.wikipedia.org/wiki/UTF8#Description for more information.
paercebal
Compiling and executing your code on Mac OS X gives the same output as on your linux machine.
Wolfgang Plaschg
@Wolfgang Plaschg : Thanks for the info. This is not unexpected, as MacOS X is a Unix, so this seems natural they went the way "char is a UTF-8" for Unicode support... AFAIK, the only reasons Windows did not follow the same road was to continue support for pre-Unicode charset-based old apps.
paercebal
A: 

When should you NOT use wide-characters?

When you're writing code before the year 1990.

Obviously, I'm being flip, but really, it's the 21st century now. 127 characters have long since ceased to be sufficient. Yes, you can use UTF8, but why bother with the headaches?

@dave: I don't know what headache does UTF-8 create which is greater than that of Widechars (UTF-16). in UTF-16, you also have multi-character characters.
Pavel Radzivilovsky
+1  A: 
  1. A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in direct vicinity of such API calls.
  2. This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how You treat it's content. My recommendation is UTF-8, so it will be able to hold all unicode characters correctly. It's a common practice on Linux, but I think Windows programs should do it also.
  3. No.
  4. Wide character is a confusing name. In the early days of Unicode, there was a belief that character can be encoded in two bytes, hence the name. Today, it stands for "any part of the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pares.

For more information, please see my answer to http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful

Pavel Radzivilovsky