views:

308

answers:

5

I used to be confusing quite a while :

http://stackoverflow.com/questions/2384160/confusion-on-unicode-and-multibyte-articles

After reading up the comments by all contributors, plus :

Looking at an old article (Year 2001) : http://www.hastingsresearch.com/net/04-unicode-limitations.shtml, which talk about unicode :

being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to over 170,000 characters.

and Looking at current "modern" article : http://en.wikipedia.org/wiki/Unicode

The most commonly used encodings are UTF-8 (which uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters), the now-obsolete UCS-2 (which uses 2 bytes for all characters, but does not include every character in the Unicode standard), and UTF-16 (which extends UCS-2, using 4 bytes to encode characters missing from UCS-2).

It seems that in the compilation options in VC2008, the options "Unicode" under Character Sets really means "Unicode encoded in UCS-2" (Or UTF-16? I am not sure)

I try to verify this by running the following code under VC2008

#include <iostream>

int main()
{
    // Use unicode encoded in UCS-2?
    std::cout << sizeof(L"我爱你") << std::endl;
    // Use unicode encoded in UCS-2?
    std::cout << sizeof(L"abc") << std::endl;
    getchar();

    // Compiled using options Character Set : Use Unicode Character Set.
    // print out 8, 8

    // Compiled using options Character Set : Multi-byte Character Set.
    // print out 8, 8
}

It seems that during compilation with Unicode Character Set options, the outcome matched my assumption.

But what about Multi-byte Character Set? What does Multi-byte Character Set means in current "modern" world? :)

A: 

multi-byte means that one character is stored in more than one byte.

extract from wikipedia on utf8:

UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single octet encoding used only for the 128 US-ASCII characters.

so essentially, utf8 is a multi-byte character set :-).

henchman
But, if you read from http://msdn.microsoft.com/en-us/library/ey142t48%28VS.71%29.aspx#_core_mfc_support_for_mbcs_strings, multi-byte character require "code page". but, UTF-8 doesn't require code page.
Yan Cheng CHEOK
from wp: "UTF-8 makes it easy for a program to identify the three sorts of units as they are kept apart. Older variable-width encodings are typically not so well designed, as in them the trail and lead units may use the same values, and in some all three sorts use overlapping values." my interpretation: Older character sets needed codepages, utf8 was designed not to need it. "Unicode has made code pages obsolete by supporting more languages and characters much more consistently"
henchman
so, does this mean, we cannot say, UTF8 = Microsoft's Multi Byte Character? :)
Yan Cheng CHEOK
@YanCheng: UTF8 is an International standard, not a Microsoft standard.
John Saunders
On Windows, UTF-8 is codepage 65001.
MSalters
+3  A: 

http://en.wikipedia.org/wiki/Multi-byte_character_set

MBCS is a term used to denote a class of character encodings with characters that cannot be represented with a single byte, hence multi-byte character set. In order to properly decode a string in this format, you need a codepage that tells you various byte combinations map to characters. ISO/IEC 8859 defines a set of MBCS standards, but according to Wikipedia, ISO stopped maintaining them in 2004, presumably to focus on Unicode.

So I guess the modern term for MBCS is "deprecated in favor of Unicode".

MSN
+1. MBCS is a specific set of encodings, so does not mean the more general case of "using more than one byte per character". Unicode, UTF8, and UTF16 are not "MBCS", although they are encoded in multiple-bytes-per-character.
Jason Williams
A: 

Your program gets sizeof(wchar_t[4]) as strings are always character arrays in C; there is no variable-length encoding of Unicode without mbtowcs and relatives.

I read that MSVC uses 16-bit wide_chars, which is obsolescent. GNU uses 32-bit characters, which are necessary to support 21-bit unicode. The MSVC encoding is thus UCS-2, which corresponds to a C array, no variable-width encoding, and probably undefined behavior for out-of-bounds characters. GNU on the other hand would use UCS-4.

UTF-16, to be clear, is a variable-length encoding.

Potatoswatter
@Potatoswatter: "I read that MSVC uses 16-bit wide chars" - you heard wrong.
John Saunders
@John: then it would depend what version you're using. MSVC at least used 16-bit wchar_t much longer than GNU. Try Googling "MSVC wchar_t." I can't find any source saying that they are 32-bit in Windows, and http://en.wikipedia.org/wiki/Wide_character#Size_of_a_wide_character makes it sound like an entrenched API issue. I don't know how to use MSDN but the first hit on Google for wchar_t is http://msdn.microsoft.com/en-us/library/aa505945.aspx which defines it as "A 16-bit Unicode character" so I'd like a source for your assertion.
Potatoswatter
@Potatoswatter: when you say "MSVC uses" a 16-bit character, it reads like you mean that it can only use 16-bit characters.
John Saunders
@John: if you mean you parsed "bit wide" together as a phrase rather than "wide char", then I "fixed" it now.
Potatoswatter
@Potatoswatter: you fix; I fix.
John Saunders
"The MSVC encoding is thus UCS-2, which corresponds to a C array, no variable-width encoding..."MBCS does do variable encoding.
YeenFei
@YeenFei: I was answering his question "It seems that in the compilation options in VC2008, the options "Unicode" under Character Sets really means "Unicode encoded in UCS-2" (Or UTF-16? I am not sure)"; UCS-2 is constant-width and UTF-16 is variable width; I'll clarify the answer that I'm not referring to proprietary encodings.
Potatoswatter
MSVC encoding follows Win32, which is UTF-16 since NT4.
MSalters
@MSalters: According to http://msdn.microsoft.com/en-us/library/cwe8bzh0.aspx "In Visual C++, MBCS always means DBCS. Character sets wider than 2 bytes are not supported." This effectively makes it UCS-2, not UTF-16.
Potatoswatter
@Potatoswatter: read the entire page, including the "MBCS is used to describe all **non-Unicode** support". You can't derive conclusions about Unicode from the MBCS page.
MSalters
@MSalters: I already included an MSDN reference that `wchar_t` is defined as 16 bits. This is getting pretty old.
Potatoswatter
A: 

Multi Byte Character Set is a general term for any encoding scheme that can use more than 1 byte to encode a character.

When you hear the term you would normally expect it to be refering to one of the older legacy character sets as in "IBM EBCDIC cp1390 - Japanese Kanji Multi Byte".

All the UNICODE schemes are technically MBCSs but you would expect them to be refered to as "UNICODE" collectively or utf-8, utf-16, or utf-32 specifically.

The only "current" software which uses an MBCS character set is Microsoft Office suite. Which uses the "Windows MBCS". This is almost identical to utf-16 apart from some minor differences. Due to Microsofts early adoption the draft standard some small pieces of the complete standard proved difficult to implement so it stuck with the term "Windows MBCS".

James Anderson
A: 

In MSVC, the options "Unicode" under Character Sets means that _T("X") expands to L"X". If set to MBCS, _T("X") expands to just "X".

Another consequence is whether the Win32 macro MessageBox() expands to MessageBoxW() or MessageBoxA, as well as macros for all other Win32 functions that come in A/W pairs.

MSalters
but that says nothing about the encodings used
jalf
@jalf: True. That applies to the IDE setting, the compiler interpretation of strings and the A/W function choice. In all three cases the distinction is boolean, and the MBCS encoding unspecified.
MSalters