tags:

views:

537

answers:

7

Is the wchar_t type required for unicode support? If not then what's the point of this multibyte type? Why would you use wchar_t when you could accomplish the same thing with char?

A: 

char is generally a single byte. (sizeof(char) must be equal to 1).

wchar_t was added to the language specifically to suppose multibyte characters.

James Curran
The C and C++ definitions of "byte" are "amount of memory taken by a single char". No need for weasel worlds like "generally" here. It might not be an _octet_ (8 bits) though.
MSalters
+3  A: 

wchar_t is not required. It's not even guaranteed to have a specific encoding. The point is to provide a data type that represents the wide characters native to your system, similar to char representing native characters. On Windows, for example, you can use wchar_t to access the wide character Win32 API functions.

Malte Clasen
+3  A: 

Because you can't accomplish the same thing with char:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

gatorfax
As the title of that post says, this is something every developer absolutely, positively must know about unicode. For that reason alone, I wish I could give more than a single upvote. :)
Epcylon
The reference is good but the statement is actually quite false. Using UTF-8 to map Unicode to legacy `char` is not only possible but quite likely the single most common encoding.
DigitalRoss
zdawg is right. You do not need wchar_t to implement Unicode properly, and using it will not necessarily even help. For one thing, the wchar_t can be as small as 8 bits. On Windows it is 16, which means you can represent a UTF-16 code unit, but /not all charaters/. That's why the Unicode standard says "Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text." You can use char, as long as you treat char as meaning "byte."
Matthew Flaschen
-1 wchar is one of the worth things invented because wchar_t may be 4 or 2 bytes.
Artyom
+3  A: 

wchar_t is absolutely NOT required for Unicode. UTF-8 for example, maintains backward compatibility with ASCII and uses plain 8-bit char. wchar_t mostly yields support for so-called multi-byte characters, or basically any character set that's encoded using more than the sizeof(char).

zdawg
It sounds like you are implying that UTF-8 encodes all characters as 8-bits, which is not only untrue, but if true would be quite a feat of data compression. UTF-8 *is* a multi-byte encoding: some characters are encoded using 8-bits, some using 16-bits, some using 24-bits and some using 32-bits. It can support (though it's not currently needed, I think) characters encoded using up to 48-bits.
Dan Moulding
+1  A: 

Be careful, wchar_t is often 16bits which is not enough to store all unicode characters and is a bad choice ofr data in UTF_8 for instance

Martin Beckett
This is not true on Linux (or, I assume other Unix-ish systems), where it's 32 bits. It depends on the compiler and runtime.
greyfade
+9  A: 

No.

Technically, no. Unicode is a standard that defines code points and it does not require a particular encoding.

So, you could use unicode with the UTF-8 encoding and then everything would fit in a one or a short sequence of legacy char objects and it would even still be null-terminated.

To answer your "then what is the point of wchars?" question...

The problem with UTF-8 is that s[i] is not necessarily a character any more, it might be just a piece of one, whereas with wider characters you can mostly preserve the abstraction that x[i] is a single character. (Though there are more than 216 code points, actually.)

DigitalRoss
Unicode only recently (4.0?) added more than 65536 code points. A conforming C++ implementation therefore has to choose: support only Unicode 3.x and 16 bits `wchar_t`, or use a 32 bits `wchar_t`. Using UTF-16 is technically non-conforming as there's no such thing as a "nullnull-terminated multi-wchar_t" encoding.
MSalters
Characters outside the BMP were first assigned in Unicode 3.1, in 2001.
dan04
A: 

You absolutely do not need wchar_t to support Unicode in the software, in fact using wchar_t makes it even harder because you do not know if "wide string" is UTF-16 or UTF-32 -- it depends on OS: under windows utf-16 all others utf-32.

However, utf-8 allows you to write Unicode enabled software easily(*)

See: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful

(*) Note: under Windows you still have to use wchar_t because it does not support utf-8 locales so for unicode enabled windows programming you have to use wchar based API.

Artyom