views:

44

answers:

3

I'm in the process of trying to learn Unicode? For me the most difficult part is the Encoding. Can BSTRs (Basic String) content code points U+10000 or higher? If no, then what's the encoding for BSTRs?

+2  A: 

In Microsoft-speak, Unicode is generally synonymous with UTF-16 (little endian if memory serves). In the case of BSTR, the answer seems to be it depends:

  • On Microsoft Windows, consists of a string of Unicode characters (wide or double-byte characters).
  • On Apple Power Macintosh, consists of a single-byte string.
  • May contain multiple embedded null characters.

So, on Windows, yes, it can contain characters outside the basic multilingual plane but these will require two 'wide' chars to store.

McDowell
I disassembled the system functions SysStringByteLen and SysStringLen. Both return the byte length prefix but SysStringByteLen first divides it by 2. Doesn't this mean the system is using UCS-2 encoding?
Mike
@Mike: I think that the documentation for `SysStringLen` is wrong, it returns the number of 16-bit **code units** in the string. Characters with code point U+10000 and higher use two 16-bit code units in UTF-16.
dalle
@dalle: That makes a lot of sense. Do you know of a function that returns the number of bytes?
Mike
@Mike - APIs generally don't try to get clever when it comes to length and variable-width encodings. The number of chars does not necessarily equal the number of Unicode code points. (You'll see the same behaviour in C# and Java.) IMHO, UTF-16 over UCS-2 support becomes more relevant in areas like font rendering and transcoding. Counting the code points would not necessarily be all that useful - a sequence of code points can combine together to render a single grapheme. http://unicode.org/reports/tr29/ It is more useful to know how much storage the artefact requires.
McDowell
@McDowell: At this point in time, the only thing I care about is how many bytes are used to represent the string. I don't need to interpret the code points. I actually don't need to know the encoding just the start address of the string and its length in bytes.
Mike
@McDowell and @dalle: Sorry I missed one of your points. BSTR uses two bytes per code point. A character can use more then one code point. Thank you both... that answers my question.
Mike
+1  A: 

BSTR's on Windows originally contained UCS-2, but can in principle contain the entire unicode set, using surrogate pairs. UTF-16 support is actually up to the API that receives the string - the BSTR has no say how it gets treated. Most API's support UTF-16 by now. (Michael Kaplan sorts out the details.)

The windows headers still contain another definition for BSTR, it's basically

#if defined(_WIN32) && !defined(OLE2ANSI)
   typedef wchar_t OLECHAR;
#else
   typedef char OLECHAR;
#endif
typedef OLECHAR * BSTR;

There's no real reason to consider the char, however, unless you desperately want to be compatible with whatever this was for. (IIRC it was active - or could be activated - for early MFC builds, and might even have been used in Office for Mac or something like that.)

peterchen
A: 

I am sure this will be very helpful if you are trying to understand character set, encoding etc etc. encoding and char set

Pangea