Is wchar_t needed for unicode support?

views:

537

answers:

+5 Q:

Is wchar_t needed for unicode support?

Is the wchar_t type required for unicode support? If not then what's the point of this multibyte type? Why would you use wchar_t when you could accomplish the same thing with char?

char is generally a single byte. (sizeof(char) must be equal to 1).

wchar_t was added to the language specifically to suppose multibyte characters.

James Curran 2010-02-13 23:36:39

The C and C++ definitions of "byte" are "amount of memory taken by a single char". No need for weasel worlds like "generally" here. It might not be an _octet_ (8 bits) though.

MSalters 2010-02-15 13:14:05

+3 A:

wchar_t is not required. It's not even guaranteed to have a specific encoding. The point is to provide a data type that represents the wide characters native to your system, similar to char representing native characters. On Windows, for example, you can use wchar_t to access the wide character Win32 API functions.

Malte Clasen 2010-02-13 23:39:54

+3 A:

Because you can't accomplish the same thing with char:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

gatorfax 2010-02-13 23:40:05

As the title of that post says, this is something every developer absolutely, positively must know about unicode. For that reason alone, I wish I could give more than a single upvote. :)

Epcylon 2010-02-14 00:04:56

The reference is good but the statement is actually quite false. Using UTF-8 to map Unicode to legacy `char` is not only possible but quite likely the single most common encoding.

DigitalRoss 2010-02-14 00:07:47

zdawg is right. You do not need wchar_t to implement Unicode properly, and using it will not necessarily even help. For one thing, the wchar_t can be as small as 8 bits. On Windows it is 16, which means you can represent a UTF-16 code unit, but /not all charaters/. That's why the Unicode standard says "Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text." You can use char, as long as you treat char as meaning "byte."

Matthew Flaschen 2010-02-14 00:26:27

-1 wchar is one of the worth things invented because wchar_t may be 4 or 2 bytes.

Artyom 2010-02-14 10:53:50

+3 A:

wchar_t is absolutely NOT required for Unicode. UTF-8 for example, maintains backward compatibility with ASCII and uses plain 8-bit char. wchar_t mostly yields support for so-called multi-byte characters, or basically any character set that's encoded using more than the sizeof(char).

zdawg 2010-02-13 23:54:16

It sounds like you are implying that UTF-8 encodes all characters as 8-bits, which is not only untrue, but if true would be quite a feat of data compression. UTF-8 *is* a multi-byte encoding: some characters are encoded using 8-bits, some using 16-bits, some using 24-bits and some using 32-bits. It can support (though it's not currently needed, I think) characters encoded using up to 48-bits.

Dan Moulding 2010-08-04 19:46:56

+1 A:

Be careful, wchar_t is often 16bits which is not enough to store all unicode characters and is a bad choice ofr data in UTF_8 for instance

Martin Beckett 2010-02-13 23:55:32

This is not true on Linux (or, I assume other Unix-ish systems), where it's 32 bits. It depends on the compiler and runtime.

greyfade 2010-02-14 00:05:13

+9 A:

No.

Technically, no. Unicode is a standard that defines code points and it does not require a particular encoding.

So, you could use unicode with the UTF-8 encoding and then everything would fit in a one or a short sequence of legacy char objects and it would even still be null-terminated.

To answer your "then what is the point of wchars?" question...

The problem with UTF-8 is that s[i] is not necessarily a character any more, it might be just a piece of one, whereas with wider characters you can mostly preserve the abstraction that x[i] is a single character. (Though there are more than 2¹⁶ code points, actually.)

DigitalRoss 2010-02-13 23:59:37

Unicode only recently (4.0?) added more than 65536 code points. A conforming C++ implementation therefore has to choose: support only Unicode 3.x and 16 bits `wchar_t`, or use a 32 bits `wchar_t`. Using UTF-16 is technically non-conforming as there's no such thing as a "nullnull-terminated multi-wchar_t" encoding.

MSalters 2010-02-15 13:11:18

Characters outside the BMP were first assigned in Unicode 3.1, in 2001.

dan04 2010-06-18 03:53:44

You absolutely do not need wchar_t to support Unicode in the software, in fact using wchar_t makes it even harder because you do not know if "wide string" is UTF-16 or UTF-32 -- it depends on OS: under windows utf-16 all others utf-32.

However, utf-8 allows you to write Unicode enabled software easily(*)

See: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful

(*) Note: under Windows you still have to use wchar_t because it does not support utf-8 locales so for unicode enabled windows programming you have to use wchar based API.

Artyom 2010-02-14 10:57:57

ansaurus

tags:

views:

answers:

Is wchar_t needed for unicode support?

No.

related questions