ansaurus

Question

Answer 1

+1 A:

Wide character strings are composed of multi-byte characters, whereas the normal C string is a char* - a sequence of byte-wide characters. Wchars are not the same thing as unicode on all platforms, though unicode representations are typically based on wchar_t

I've seen wchars used in embedded systems like phones, where you want filenames with special characters but don't necessarily want to support all the glory and complexity of unicode.

Typical usage would be converting a 2-byte based string to a regular C string, and vica versa

Sam Post 2010-02-03 07:05:12

Answer 2

+1 A:

According to the C standard, wchar_t type is "capable of representing any character in the current locale". The standard doesn't say what the encoding for wchar_t is. In fact, the limits on WCHAR_MIN and WCHAR_MAX are [0, 255] or [-127, 127], depending upon whether wchar_t is unsigned or signed.

A multibyte character can use more than one byte. A multibyte string is made of one or more multibyte characters. In a multibyte string, each character need not be of equal number of bytes (UTF-8 is an example). Whereas, an object of type wchar_t has a fixed size (in a given implementation, of course).

As an aside, I can also find the following in my copy of the C99 draft:

__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month.

So, if I understood correctly, if __STDC_ISO_10646__ is defined, then wchar_t can store Unicode characters.

Alok 2010-02-03 07:18:24

Answer 3

+3 A:

It converts whatever your platform uses for a "wide char" (which I'm lead to believe is indeed UCS2 on Windows, but is usually UCS4 on UNIX) into your current locale's default multibyte character encoding. If your locale is a UTF-8 one, then that is the multibyte encoding that will be used - but note that there are other possibilities, like JIS.

caf 2010-02-03 07:18:59

On Windows that is UTF-16, not UCS2.

Mihai Nita 2010-02-03 08:48:02

Fair enough. (That seems somewhat broken - the whole point of widechars was supposed to be that one widechar is always exactly one character).

caf 2010-02-03 22:10:24

Answer 4

+1 A:

You use the setlocale() standard function with the LC_CTYPE (or LC_ALL) category to set the mapping the library uses between wchar_t characters and multibyte characters. The actual locale name passed to setlocale() is implementation defined, so you'll need to look it up in your compiler's docs.

For example, with MSVC you might use

setlocale( LC_ALL, ".1252" );

to set the C runtime to use codepage 1252 as the multibyte character set. Note that MSVC docs explicitly indicates that the locale cannot be set to UTF-7 or UTF8 for the multibyte character sets:

The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL.

The "wide-character" wchar_t type is intended to be able to support any character set the system supports - the standard doesn't define the size of a wchar_t type (it could be as small as a char or any of the larger integer types). On Windows it's the system's 'internal' Unicode encoding, which is UTF-16 (UCS-2 before WinXP). Honestly, I can't find a direct quote on that in the MSVC docs, though. Strictly speaking, the implementation should call this out, but I can't find it.

Michael Burr 2010-02-03 07:48:45

Warning: there is no standard for the locale string in setlocale, so it is not easy to do anything cross-platform. For instance .1252 is valid on Windows, but not on UNIX/Linux (there you will see stuff like en_US.UTF-8 or en_US.iso889-1)

Mihai Nita 2010-02-03 08:50:13

ansaurus

tags:

views:

answers:

wcstombs: character encoding?

related questions