tags:

views:

243

answers:

2

Hello, I have a variant bstr that was pulled from MSXML DOM, so it is in UTF-16. I'm trying to figure out what default encoding occurs with this conversion:

VARIANT vtNodeValue;
pNode->get_nodeValue(&vtNodeValue);
string strValue = (char*)_bstr_t(vtNodeValue);

From testing, I believe that the default encoding is either Windows-1252 or Ascii, but am not sure.

Btw, this is the chunk of code that I am fixing and converting the variant to a wstring and going to a multi-byte encoding with a call to WideCharToMultiByte.

Thanks!

A: 

std::string by itself doesn't specify/contain any encoding. It is merely a sequence of bytes. The same holds for std::wstring, which is merely a sequence of wchar_ts (double-byte words, on Win32).

By converting _bstr_t to a char* through its operator char*, you'll simply get a pointer to the raw data. According to MSDN, this data consists of wide characters, that is, wchar_ts, which represent UTF-16.

I'm surprised that it actually works to construct a std::string from this; you should not get past the first zero byte (which occurs soon, if your original string is English).

But since wstring is a string of wchar_t, you should be able to construct one directly from the _bstr_t, as follows:

_bstr_t tmp(vtNodeValue);
wstring strValue((wchar_t*)tmp, tmp.length());

(I'm not sure about length; is it the number of bytes or the number of characters?) Then, you'll have a wstring that's encoded in UTF-16 on which you can call WideCharToMultiByte.

Thomas
That's not right, it's not really a cast, `bstr_t` has an `operator char*` defined which does conversion internally.
Tim Sylvester
I know. Is the word "cast" inappropriate? Maybe "conversion operator" is better. I'll change it.
Thomas
That is incorrect: casting a `_bstr_t` to `char*` calls the `_com_util::ConvertBSTRToString` function to convert the string to a byte-based encoding.
interjay
I guess you can call it a cast, but you're definitely not just getting a pointer to the wide-char data.
Tim Sylvester
From http://msdn.microsoft.com/en-us/library/btdzb8eb(VS.71).aspx : "These operators can be used to extract raw pointers to the encapsulated Unicode or multibyte BSTR object. The operators return the pointer to the actual internal buffer, so the resulting string cannot be modified."No mention of any conversion. Is MSDN wrong?
Thomas
@Thomas I suspect the intent of that statement was to indicate that you do not need to deallocate the result. What that statement doesn't say but only implies is that there are actually *two* internal buffers. (Actually rather confusing considering they say "the" pointer to "the" internal buffer.) You get a different pointer value, not just a differently-typed pointer to the same address, depending on what operator you use. The fact that there are both wide and narrow buffers further implies that `bstr_t` must be doing internal encoding conversions.
Tim Sylvester
+4  A: 

The operator char* method calls _com_util::ConvertBSTRToString(). The documentation is pretty unhelpful, but I assume it uses the current locale settings to do the conversion.

Update:

Internally, _com_util::ConvertBSTRToString() calls WideCharToMultiByte, passing zero for all the code-page and default character parameters. This is the same as passing CP_ACP, which means to use the system's current ANSI code-page setting (not the current thread setting).

If you want to avoid losing data, you should probably call WideCharToMultiByte directly and use CP_UTF8. You can still treat the string as a null-terminated single-byte string and use std::string, you just can't treat bytes as characters.

Tim Sylvester
Thanks!!!The default code page on US Windows is 1252, which is consistent with what I have observed. This can be determined on any machine with this call: int nCodePage=GetACP();