views:

235

answers:

6

I am confused about Windows BSTR's and WCHAR's, etc. WCHAR is a 16-bit character intended to allow for Unicode characters. What about characters that take more then 16-bits to represent? Some UTF-8 chars require more then that. Is this a limitation of Windows?

Edit: Thanks for all the answers. I think I understand the Unicode aspect. I am still confused on the Windows/WCHAR aspect though. If WCHAR is a 16-bit char, does Windows really use 2 of them to represent code-points bigger than 16-bits or is the data truncated?

+2  A: 

UTF8 is an encoding of a Unicode character (codepoint). You may want to read this excellent faq on the subject. To answer your question though, BSTRs are always encoded as UTF-16. If you have UTF-32 encoded strings, you will have to transcode them first.

Jeff Paquette
+1  A: 

The Unicode standard defines somewhere over a million unique code-points (each code-point represents an 'abstract' character or symbol - e.g. 'E', '=' or '~').

The standard also defines several methods of encoding those million code-points into commonly used fundamental data types, such as 8-bit characters, or 16-byte wchars.

The two most widely used encodings are utf-8 and utf-16. utf-8 defines how to encode unicode code points into 8-bit chars. Each unicode code-point will map to between 1 and 4 8-bit chars.

utf-16 defines how to encode unicode code points into 16-bit words (WCHAR in Windows). Most code-points will map onto a single 16-bit WCHAR, but there are some that require two WCHARs to represent.

I recommend taking a look at the Unicode standard, and especially the FAQ (http://unicode.org/faq/utf%5Fbom.html)

sdtom
UCS-2 is not UTF-16. There is a big difference between the 2. UCS-2 only allows for a single WCHAR per character. UTF-16 allows characters to be represented as either 1 or 2 WCHARs.
Jon Benedicto
Yep - removed the ucs-2 reference
sdtom
+1  A: 

As others have mentioned, the FAQ has a lot of great information on unicode.

The short answer to your question, however, is that a single unicode character may require more than one 16bit character to represent it. This is also how UTF-8 works; any unicode character that falls outside the range that a single byte is able to represent uses two (or more) bytes.

ShZ
+5  A: 

UTF-8 is not the encoding used in Windows' BSTR or WCHAR types. Instead, they use UTF-16, which defines each code point in the Unicode set using either 1 or 2 WCHARs. 2 WCHARs gives exactly the same amount of code points as 4 bytes of UTF-8.

So there is no limitation in Windows character set handling.

Jon Benedicto
I seem to recall that Windows used to consider Unicode strings to be UCS-2. Is that true? When did it change?
Rob Kennedy
Windows 2000 introduced UTF-16, IIRC.
Jon Benedicto
+1  A: 

BSTR simply contains 16 bit code units that can contain any UTF-16 encoded data. As for the OS, Windows has supported surrogate pairs since XP. See the Dr International FAQ

Nemanja Trifunovic
A: 

Windows has used UTF-16 as its native representation since Windows 2000; prior to that it used UCS-2. UTF-16 supports any Unicode character; UCS-2 only supports the BMP. i.e. it will do the right thing.

In general, though, it doesn't matter much, anyway. For most applications strings are pretty opaque, and just passed to some I/O mechanism (for storage in a file or database, or display on-screen, etc.) that will do the Right Thing. You just need to ensure you don't damage the strings at all.

DrPizza