ansaurus

Question

What is the current modern term for "Multi-byte Character Set"

Answer 1

A:

multi-byte means that one character is stored in more than one byte.

extract from wikipedia on utf8:

UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single octet encoding used only for the 128 US-ASCII characters.

so essentially, utf8 is a multi-byte character set :-).

henchman 2010-03-10 03:15:05

But, if you read from http://msdn.microsoft.com/en-us/library/ey142t48%28VS.71%29.aspx#_core_mfc_support_for_mbcs_strings, multi-byte character require "code page". but, UTF-8 doesn't require code page.

Yan Cheng CHEOK 2010-03-10 03:17:33

from wp: "UTF-8 makes it easy for a program to identify the three sorts of units as they are kept apart. Older variable-width encodings are typically not so well designed, as in them the trail and lead units may use the same values, and in some all three sorts use overlapping values." my interpretation: Older character sets needed codepages, utf8 was designed not to need it. "Unicode has made code pages obsolete by supporting more languages and characters much more consistently"

henchman 2010-03-10 03:19:38

so, does this mean, we cannot say, UTF8 = Microsoft's Multi Byte Character? :)

Yan Cheng CHEOK 2010-03-10 03:23:02

@YanCheng: UTF8 is an International standard, not a Microsoft standard.

John Saunders 2010-03-10 03:28:49

On Windows, UTF-8 is codepage 65001.

MSalters 2010-03-11 10:01:47

Answer 2

+3 A:

http://en.wikipedia.org/wiki/Multi-byte_character_set

MBCS is a term used to denote a class of character encodings with characters that cannot be represented with a single byte, hence multi-byte character set. In order to properly decode a string in this format, you need a codepage that tells you various byte combinations map to characters. ISO/IEC 8859 defines a set of MBCS standards, but according to Wikipedia, ISO stopped maintaining them in 2004, presumably to focus on Unicode.

So I guess the modern term for MBCS is "deprecated in favor of Unicode".

MSN 2010-03-10 07:02:01

+1. MBCS is a specific set of encodings, so does not mean the more general case of "using more than one byte per character". Unicode, UTF8, and UTF16 are not "MBCS", although they are encoded in multiple-bytes-per-character.

Jason Williams 2010-03-10 21:06:04

Answer 3

A:

Your program gets sizeof(wchar_t[4]) as strings are always character arrays in C; there is no variable-length encoding of Unicode without mbtowcs and relatives.

I read that MSVC uses 16-bit wide_chars, which is obsolescent. GNU uses 32-bit characters, which are necessary to support 21-bit unicode. The MSVC encoding is thus UCS-2, which corresponds to a C array, no variable-width encoding, and probably undefined behavior for out-of-bounds characters. GNU on the other hand would use UCS-4.

UTF-16, to be clear, is a variable-length encoding.

Potatoswatter 2010-03-10 08:04:09

@Potatoswatter: "I read that MSVC uses 16-bit wide chars" - you heard wrong.

John Saunders 2010-03-10 19:56:50

@John: then it would depend what version you're using. MSVC at least used 16-bit wchar_t much longer than GNU. Try Googling "MSVC wchar_t." I can't find any source saying that they are 32-bit in Windows, and http://en.wikipedia.org/wiki/Wide_character#Size_of_a_wide_character makes it sound like an entrenched API issue. I don't know how to use MSDN but the first hit on Google for wchar_t is http://msdn.microsoft.com/en-us/library/aa505945.aspx which defines it as "A 16-bit Unicode character" so I'd like a source for your assertion.

Potatoswatter 2010-03-10 20:05:54

@Potatoswatter: when you say "MSVC uses" a 16-bit character, it reads like you mean that it can only use 16-bit characters.

John Saunders 2010-03-10 20:20:39

@John: if you mean you parsed "bit wide" together as a phrase rather than "wide char", then I "fixed" it now.

Potatoswatter 2010-03-10 20:31:29

@Potatoswatter: you fix; I fix.

John Saunders 2010-03-10 20:51:30

"The MSVC encoding is thus UCS-2, which corresponds to a C array, no variable-width encoding..."MBCS does do variable encoding.

YeenFei 2010-03-11 00:41:27

@YeenFei: I was answering his question "It seems that in the compilation options in VC2008, the options "Unicode" under Character Sets really means "Unicode encoded in UCS-2" (Or UTF-16? I am not sure)"; UCS-2 is constant-width and UTF-16 is variable width; I'll clarify the answer that I'm not referring to proprietary encodings.

Potatoswatter 2010-03-11 03:25:52

MSVC encoding follows Win32, which is UTF-16 since NT4.

MSalters 2010-03-11 10:00:22

@MSalters: According to http://msdn.microsoft.com/en-us/library/cwe8bzh0.aspx "In Visual C++, MBCS always means DBCS. Character sets wider than 2 bytes are not supported." This effectively makes it UCS-2, not UTF-16.

Potatoswatter 2010-03-12 07:22:09

@Potatoswatter: read the entire page, including the "MBCS is used to describe all **non-Unicode** support". You can't derive conclusions about Unicode from the MBCS page.

MSalters 2010-03-22 12:35:11

@MSalters: I already included an MSDN reference that `wchar_t` is defined as 16 bits. This is getting pretty old.

Potatoswatter 2010-03-22 17:43:01

Answer 4

A:

Multi Byte Character Set is a general term for any encoding scheme that can use more than 1 byte to encode a character.

When you hear the term you would normally expect it to be refering to one of the older legacy character sets as in "IBM EBCDIC cp1390 - Japanese Kanji Multi Byte".

All the UNICODE schemes are technically MBCSs but you would expect them to be refered to as "UNICODE" collectively or utf-8, utf-16, or utf-32 specifically.

The only "current" software which uses an MBCS character set is Microsoft Office suite. Which uses the "Windows MBCS". This is almost identical to utf-16 apart from some minor differences. Due to Microsofts early adoption the draft standard some small pieces of the complete standard proved difficult to implement so it stuck with the term "Windows MBCS".

James Anderson 2010-03-10 08:37:52

Answer 5

A:

In MSVC, the options "Unicode" under Character Sets means that _T("X") expands to L"X". If set to MBCS, _T("X") expands to just "X".

Another consequence is whether the Win32 macro MessageBox() expands to MessageBoxW() or MessageBoxA, as well as macros for all other Win32 functions that come in A/W pairs.

MSalters 2010-03-10 08:53:26

but that says nothing about the encodings used

jalf 2010-03-11 05:19:18

@jalf: True. That applies to the IDE setting, the compiler interpretation of strings and the A/W function choice. In all three cases the distinction is boolean, and the MBCS encoding unspecified.

MSalters 2010-03-11 10:04:53

ansaurus

tags:

views:

answers:

What is the current modern term for "Multi-byte Character Set"

related questions