views:

144

answers:

5

I'm currently developing a cross-platform C++ library which I intend to be Unicode aware. I currently have compile-time support for either std::string or std::wstring via typedefs and macros. The disadvantage with this approach is that it forces you to use macros like L("string") and to make heavy use of templates based on character type.

What are the arguments for and against to support std::wstring only?

Would using std::wstring exclusively hinder the GNU/Linux user base, where UTF-8 encoding is preferred?

+2  A: 

What are the arguments for and against to support std::wstring only?

The argument in favor of using wide characters is that it can do everything narrow characters can and more.

The argument against it that I know are:

  • wide characters need more space (which is hardly relevant, the Chinese do not, in principle, have more headaches over memory than Americans have)
  • using wide characters gives headaches to some westerners who are used for all their characters to fit into 7bit (and are unwilling to learn to pay a bit of attention to not to intermingle uses of the character type for actual characters vs. other uses)

As for being flexible: I have maintained a library (several kLoC) that could deal with both narrow and wide characters. Most of it was through the character type being a template parameter, I don't remember any macros (other than UNICODE, that is). Not all of it was flexible, though, there was some code in there which ultimately required either char or wchar_t string. (No point in making internal key strings wide using wide characters.)
Users could decide whether they wanted only narrow character support (in which case "string" was fine) or only wide character support (which required them to use L"string") or whether they wanted to support both, too (which required something like T("string")).

sbi
Did you have support for both in the same compilation, like Boost with their format and wformat? Or did you require users to compile one or the other version of the library?
Oskar N.
I don't know boost's `format`/`wformat`, but everything we had in that lib that users might need as either system-encoded or Unicode was templatized on the character type.
sbi
+2  A: 

For:

Against:

  • You might have to interface with code that isn't i18n-aware. But like any good library writer, you'll just hide that mess behind an easy-to-use interface, right? Right?
Kristo
Seems like a great article. I will read it later. Does it mention anything about using std::wstring on GNU/Linux platforms?
Oskar N.
@Kristo: it's unfortunate of course that Joel is mainly a Windows guy and as such his perspective is rather... short-sighted... when it comes to cross-platform. A quick search on "linux" and "unix" on the page brought a single mention: in the historical section.
Matthieu M.
+2  A: 

A lot of people would want to use unicode with UTF-8 (std::string) and not UCS-2 (std::wstring). UTF-8 is the standard encoding on a lot of linux distributions and databases - so not supporting it would be a huge disadvantage. On Linux every call to a function in your library with a string as argument would require the user to convert a (native) UTF-8 string to std::wstring.

On gcc/linux each character of a std::wstring will have 4 bytes while it will have 2 bytes on Windows. This can lead to strange effects when reading or writing files (and copying them from/to different platforms). I would rather recomend UTF-8/std::string for a cross platform project.

David Feurle
Good point. Also it seems GCC does not behave well in an environment where std::string and std::wstring are mixed.
Oskar N.
@Oskar N. What kind of issues ? I never had any problem using both with gcc.
ereOn
for example the different sizes of wchar_t with gcc (4 bytes) and visual studio (2 bytes)
David Feurle
@ereOn Try to print to std::cout and std::wcout in the same program. Only wcout is actually printed. (I just tried this and it seems to work, I must be doing something wrong in my library then... Hmm.)
Oskar N.
What about UTF-8/std::string on Microsoft Windows? Windows uses UTF-16 internally. Is the only viable option to support both, or would I be able to be truly cross-platform with only UTF-8/std::string even on Windows?
Oskar N.
Don't forget that in *any* Unicode encoding, 'characters' may actually be made up of several combining code points --- which means that you can no longer assume that a single wchar holds a single printable thing. This means you've got to be very careful what you do with wide strings. Breaking a wide string inside a combining character can do really bizarre stuff. Given therefore that you can't do random-access inside a string *anyway*, I would second the suggestion that you use UTF-8; it's about the same amount of work and it's much easier to interoperate with old-fashioned ASCII strings.
David Given
exactly UTF-16 internally used by windows > 2000 (not UCS-2) is a multibyte character set....
David Feurle
QT for example (a cross platform lib) is based on UTF-8 and has only basic support for std::wstring
David Feurle
I consider it a bad idea to store UTF-8 in `std::string` because I have learned the hard way that this is problematic. If you do this, you cannot tell from looking at a string's type whether it contains system encoded or UTF-8 encoded characters. (Even in a Unicode application, you will still need a lot of ASCII strings.) In a rather big application I've seen a huge amount of bugs coming in because of UTF-8 strings showing up in the GUI. This only changed after a special instance of `std::basic_string<>` was used for UTF-8, so that the compiler flagged straight assignments as errors.
sbi
nice idea to add a special instance of std::basic_string. How did you do this? All I see is add a typedef to char but then the compiler won't be able to help you. Until now I considered all instances of std::string to contain utf-8 since ASCII is a subset of UTF-8. Since the native encoding on linux is (often) UTF-8 strings will be rendered nicely even in the gui when using utf-8
David Feurle
See as well this post: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful
David Feurle
@David: (You should properly @address comment responses. I only saw this accidentally.) You could use `std::basic_string<unsigned char>` or `std::basic_string<signed char>`. The problem with using `std::string` for UTF-8 is that `my_str.length()` lies, and `++idx` gets you to the next byte, instead of the next character.
sbi
+1  A: 

Disadvantage:

Since wstring is truly UCS-2 and not UTF-16. I will kick you in the shins one day. And it will kick hard.

Kugel
+2  A: 

I would say that using std::string or std::wstring is irrelevant.

None offer proper Unicode support anyway.

If you need internationalization, then you need proper Unicode support and should start investigating about libraries such as ICU.

After that, it's a matter of which encoding use, and this depends on the platform you're on: wrap the OS-dependent facilities behind an abstraction layer and convert in the implementation layer when applicable.

Don't worry about the encoding internally used by the Unicode library you use (or build ? hum), it's a matter of performance and should not impact the use of the library itself.

Matthieu M.