views:

1073

answers:

4

Most texts on the C++ standard library mention wstring as being the equivalent of string, except parameterized on wchar_t instead of char, and then proceed to demonstrate string only.

Well, sometimes, there are some specific quirks, and here is one: I can't seem to assign a wstring from an NULL-terminated array of 16-bit characters. The problem is the assignment happily uses the null character and whatever garbage follows as actual characters. Here is a very small reduction:

typedef unsigned short PA_Unichar;
PA_Unichar arr[256];
fill(arr); // sets to 52 00 4b 00 44 00 61 00 74 00 61 00 00 00 7a 00 7a 00 7a 00
// now arr contains "RKData\0zzz" in its 10 first values
wstring ws;
ws.assign((const wchar_t *)arr);
int l = ws.length();

At this point l is not the expected 6 (numbers of chars in "RKData"), but much larger. In my test run, it is 29. Why 29? No idea. A memory dump doesn't show any specific value for the 29th character.

So the question: is this a bug in my standard C++ library (Mac OS X Snow Leopard), or a bug in my code? How am I supposed to assign a null-terminated array of 16-bit chars to a wstring?

Thanks

+9  A: 

Under most Unixes (Mac OS X as well), whar_t represents UTF-32 single code point, and not 16bit utf-16 point like at windows.

So you need to:

  1. Either:

    ws.assing(arr,arr + length_of_string);
    

    That would use arr as iterator and copy each short int to wchar_t. But this would work only if your characters lay in BMP or representing UCS-2 (16bit legacy encoding).

  2. Or, correctly work with utf-16: converting utf-16 to utf-32 -- you need to find surrogate pairs and merge them to single code point.

Artyom
A: 

I'd think your code would work, just by inspection. But you could always work around the trouble:

ws.assign(static_cast<const wchar_t*>(arr), wcslen(arr));
Managu
If ws.assign can't find the proper terminating point of the string by picking out the null character, why would wcslen? I think Artyom hit the nail on the head -- wchar_t != unsigned short.
Nick Meyer
+3  A: 

Just do it. You didn't in your code, you assigned an array of unsigned shorts to a wstring and you used a cast to shut the compiler up. wchar_t != unsigned short. You certainly can't assume they have the same size.

Logan Capaldo
A: 

Actually, the bug is that I assumed wchar_t was 16 bits. It's not, it's 32 bits. So the cast is clearly wrong.

So the answer will be to convert from whatever 16 bit encoding my data source uses to UTF-32.

Artyom's 2nd suggestion was about right.

As a crude solution, I'll have to append the source characters one by one iterating until I get a zero value.

Jean-Denis Muys
If this is the case, why don't you accept Artyom's answer and put this into a comment to it?
sbi
Probably because he/she's new here. With a reputation of 37, he/she cannot accept an answer.
Adrian McCarthy
You can't accept an answer with a low reputation? Why do they allow you to ask questions then? The FAQ says "Reputation is completely optional. Normal use of Stack Overflow — that is, asking and answering questions — does not require any reputation whatsoever. " Likewise here http://meta.stackoverflow.com/questions/5234/accepting-answers-what-is-it-all-about makes no mention of requiring reputation to accept an answer.
Logan Capaldo
I apologize, I had not realized I had to accept an answer. Indeed I'm new here.
Jean-Denis Muys