tags:

views:

74

answers:

1

Currently, I am developing an app for a China customer. China customer are mostly switch to GB2312 language in their OS encoding. I need to write a text file, which will be encoded using GB2312.

  1. I use std::ofstream file
  2. I compile my application under MBCS mode, not unicode.
  3. I use the following code, to convert CString to std::string, and write it to file using ofstream

std::string Utils::ToString(CString& cString) {
    /* Will not work correctly, if we are compiled under unicode mode. */
    return (LPCTSTR)cString;
}

To my surprise. It just works. I thought I need to at least make use of wstring. I try to do some investigation.

Here is the MBCS.txt generated.

alt text

  1. I try to print a single character named 脚 (its value is 0xBDC5)
  2. When I use CString to carry this character, its length is 2.
  3. When I use Utils::ToString to perform conversion to std::string, the returned string length is 2.
  4. I write to file using std::ofstream

My question is :

  1. When I exam MBCS.txt using a hex editor, the value is displayed as BD (LSB) and C5 (MSB). But I am using little endian machine. Isn't hex editor should show me C5 (LSB) and BD (MSB)? I check from wikipedia. GB2312 seems doesn't specific endianness.
  2. It seems that using std::string + CString just work fine for my case. May I know in what case, the above methodology will not work? and when I should start to use wstring?
A: 

About 1. Endianness is a problem you meet when you serialize a unit in term of smaller units (i.e. serialize seizets in term of octets). I'm far from being a specialist of CJK encodings, but it seems to me that GB2112 is a coded character set which can be used with several encoding schemes. The encoding schemes cited in the wikipedia page as being used for GB2112 (ISO 2022, EUC-CN and HZ) are all defined in terms of octets. So there is no endianness issue if serialized as octets.

Contrast this with Unicode encoding schemes: UTF-8 is defined in terms of octets and has no endianness issue when serialized as octets, UTF-16 is defined in terms of seizets and if serialized as octets endianness must be specified, UTF-32 is defined in terms of 32 bits units and if serialized as octets endianness must be specified.

AProgrammer