views:

1037

answers:

5

I'm working on a english only C++ program for Windows where we were told "always use std::wstring", but it seems like nobody on the team really has much of an understanding beyond that.

I already read the question titled "std::wstring VS std::string. It was very helpful, but I still don't quite understand how to apply all of that information to my problem.

The program I'm working on displays data in a Windows GUI. That data is persisted as XML. We often transform that XML using XSLT into HTML or XSL:FO for reporting purposes.

My feeling based on what I have read is that the HTML should be encoded as UTF-8. I know very little about GUI development, but the little bit I have read indicates that the GUI stuff is all based on UTF-16 encoded strings.

I'm trying to understand where this leaves me. Say we decide that all of our persisted data should be UTF-8 encoded XML. Does this mean that in order to display persisted data in a UI component, I should really be performing some sort of explicit UTF-8 to UTF-16 transcoding process?

I suspect my explanation could use clarification, so I'll try to provide that if you have any questions.

+5  A: 

Windows from NT4 onwards is based on Unicode encoded strings, yes. Early versions were based on UCS-2, which is the predecessor or UTF-16, and thus does not support all of the characters that UTF-16 does. Later versions are based on UTF-16. Not all OSes are based on UTF-16/UCS-2, though. *nix systems, for instance, are based on UTF-8 instead.

UTF-8 is a very good choice for storing data persistently. It is a universally supported encoding in all Unicode environments, and it is a good balance between data size and loss-less data compatibility.

Yes, you would have to parse the XML, extract the necessary information from it, and decode and transform it into something the UI can use.

Remy Lebeau - TeamB
It's not really accurate to say that *nix is based on UTF-8 the way Windows is based on UTF-16. It's based on a locale-defined character encoding (in Windows terminology, ANSI). POSIX requires that certain characters (including NUL) be represented in a single byte, so UTF-16 and UTF-32 aren't permitted, but UTF-8 is.
dan04
+2  A: 

One advantage to using std::wstring on Windows for GUI related strings, is that internally all Windows API calls use and operate on UTF-16. If you've ever noticed there are 2 versions of all Win32 API calls that take string arguments. For example, "MessageBoxA" and "MessageBoxW". Both definitions exist in , and in fact you can call either you want, but if is included with Unicode support enabled, then the following will happen:

#define MessageBox MessageBoxW

Then you get into TCHAR's and other Microsoft tricks to try and make it easier to deal with APIs that have both an ANSI and Unicode version. In short, you can call either, but under the hood the Windows kernel in Unicode based, so you'll be paying the cost of converting to Unicode for each string accepting Win32 API call if you don't use the wide char version.

UTF-16 and Windows kernel use

cpalmer
+1  A: 

std::wstring is technically UCS-2: two bytes are used for each character and the code tables mostly map to Unicode format. It's important to understand that UCS-2 is not the same as UTF-16! UTF-16 allows "surrogate pairs" in order to represent characters which are outside of the two-byte range, but UCS-2 uses exactly two bytes for each character, period.

The best rule for your situation is to do your transcoding when you read and write to the disk. Once it's in memory, keep it in UCS-2 format. Windows APIs will read it as if it were UTF-16 (which is to say, while std::wstring doesn't understand the concept of surrogate pairs, if you manually create them (which you won't, if your only language is English), Windows will read them).

Whenever you're reading data in or out of serialization formats (such as XML) in the modern day, you'll probably need to do transcoding. It's an unpleasant and very unfortunate fact of life, but inevitable since Unicode is a variable-width character encoding and most character-based operations in C++ are done as arrays, for which you need consistent spacing.

Higher-level frameworks, such as .NET, obscure most of the details, but behind the scenes, they're handling the transcoding in the same fashion: changing variable-width data to fixed-width strings, manipulating them, and then changing them back into variable-width encodings when required for output.

Dan Story
What say that std::wstring is UCS-2? std::wstring just use wchar_t instead och char as base for the string. And wchar_t is implementation dependent. But I guess in most modern 32/64-bit system it will be same as char16_t. In which ether UCS-2 or UTF-16 would fit since they are 16 bit wide.
jpyllman
Good point. std::wstring isn't technically a character encoding of any kind. It's just two-byte wide characters. But UTF-16 is **not** 16 bits wide! It uses a **minimum** of 16 bits to store a character, but can use up to 32 bits if the character requires! This has led to a number of buffer-overrun attacks against applications which measure UTF-16 encoded strings in characters and then mistakenly allocate (characters+1)*2 bytes of storage and blindly copy the string!
Dan Story
@Dan Story: And it can be even worse than that if there are combing characters to deal with in a single grapheme.
Billy ONeal
OK, wrong wording, every token in UTF-16 is 16-bit, but the resulting character could be 32-bit. And if I'm not wrong, in UTF-8 every token is 8-bit but could hold a 31-bit character?
jpyllman
I'm not sure if char16_t really was defined in current C++ standard. But in the coming C++0x they say wchar_t should hold the 'largest extended character set specified among the supported locales'. If that mean 16 or 32 bit on windows I'm not sure.
jpyllman
@Dan: `std::wstring` doesn't necessarily use two-byte wide characters either. It uses `wchar_t`. On some Linux systems, this will be a 4-byte character (which might be a UTF-32 character). On Windows, it will be a UTF-16 code unit.
jamesdlin
I think after all of this "clarification" the original poster is probably eyeing suicide as a preferable alternative to character encoding. :)
Dan Story
+1  A: 

Even if you say you only have English in your data, you're probably wrong. Since we're in a global world now, names/addresses/etc have foreign characters. OK, I do not know what type of data you have, but generally I would say build your application to support UNICODE for both storing data and displaying data to user. That would suggest using XML with UTF-8 for storing and UNICODE versions of Windows calls when you do GUI. And since Windows GUI uses UTF-16, where each token is 16-bit, I would suggest storing the data in the application in an 16-bit wide string. And I would guess your compiler for windows would have std::wstring as 16-bit for just this purpose.

So then you have to do a lot of conversion between UTF-16 and UTF-8. Do that with some existing library, like for instance ICU.

jpyllman
Of curse there is nothing wrong storing data in XML with UTF-16. But I would suggest UTF-8 anyway for more easy portability between different systems.
jpyllman
UTF-8 is almost always a better choice for XML anyway, because the predominance of English as a computing language means that the majority of all markup characters in most XML documents are going to fall inside the 8-bit range of UTF-8 encoding, resulting in significant space savings. This generally applies even if the language of the document *content* is in a non-English language that uses an extended character set.
Dan Story
And actually it might even be better to use std::string in the program and store UTF-8 in them. And then only convert to UTF-16 for Windows when you want to display something. Only working with UTF-8 in every other sense.
jpyllman
+3  A: 

AFAIK when you work with std::wstring on Windows in C++ and store using UTF-8 in files (which sounds good and reasonable), then you have to convert the data to UTF-8 when writing to a file, and convert back to UTF-16 when reading from a file. Check out this link: Writing UTF-8 Files in C++.

I would stick with the Visual Studio default of project -> Properties -> Configuration Properties -> General -> Character Set -> Use Unicode Character Set, use the wchar_t type (i.e. with std::wstring) and not use the TCHAR type. (E.g. I would just use the wcslen version of strlen and not _tcslen.)

Jim Flood