ansaurus

Question

Answer 1

+2 A:

In Microsoft-speak, Unicode is generally synonymous with UTF-16 (little endian if memory serves). In the case of BSTR, the answer seems to be it depends:

On Microsoft Windows, consists of a string of Unicode characters (wide or double-byte characters).

On Apple Power Macintosh, consists of a single-byte string.

May contain multiple embedded null characters.

So, on Windows, yes, it can contain characters outside the basic multilingual plane but these will require two 'wide' chars to store.

McDowell 2010-10-29 19:49:43

I disassembled the system functions SysStringByteLen and SysStringLen. Both return the byte length prefix but SysStringByteLen first divides it by 2. Doesn't this mean the system is using UCS-2 encoding?

Mike 2010-10-29 20:13:15

@Mike: I think that the documentation for `SysStringLen` is wrong, it returns the number of 16-bit **code units** in the string. Characters with code point U+10000 and higher use two 16-bit code units in UTF-16.

dalle 2010-10-29 20:34:48

@dalle: That makes a lot of sense. Do you know of a function that returns the number of bytes?

Mike 2010-10-29 20:42:45

@Mike - APIs generally don't try to get clever when it comes to length and variable-width encodings. The number of chars does not necessarily equal the number of Unicode code points. (You'll see the same behaviour in C# and Java.) IMHO, UTF-16 over UCS-2 support becomes more relevant in areas like font rendering and transcoding. Counting the code points would not necessarily be all that useful - a sequence of code points can combine together to render a single grapheme. http://unicode.org/reports/tr29/ It is more useful to know how much storage the artefact requires.

McDowell 2010-10-29 20:47:48

@McDowell: At this point in time, the only thing I care about is how many bytes are used to represent the string. I don't need to interpret the code points. I actually don't need to know the encoding just the start address of the string and its length in bytes.

Mike 2010-10-29 20:59:27

@McDowell and @dalle: Sorry I missed one of your points. BSTR uses two bytes per code point. A character can use more then one code point. Thank you both... that answers my question.

Mike 2010-10-29 21:28:41

Answer 2

+1 A:

BSTR's on Windows originally contained UCS-2, but can in principle contain the entire unicode set, using surrogate pairs. UTF-16 support is actually up to the API that receives the string - the BSTR has no say how it gets treated. Most API's support UTF-16 by now. (Michael Kaplan sorts out the details.)

The windows headers still contain another definition for BSTR, it's basically

#if defined(_WIN32) && !defined(OLE2ANSI)
   typedef wchar_t OLECHAR;
#else
   typedef char OLECHAR;
#endif
typedef OLECHAR * BSTR;

There's no real reason to consider the char, however, unless you desperately want to be compatible with whatever this was for. (IIRC it was active - or could be activated - for early MFC builds, and might even have been used in Office for Mac or something like that.)

peterchen 2010-10-29 20:24:55

Answer 3

A:

I am sure this will be very helpful if you are trying to understand character set, encoding etc etc. encoding and char set

Pangea 2010-10-29 20:26:44

ansaurus

tags:

views:

answers:

Are BSTR UTF-16 Encoded?

related questions