How does the Chinese GB18030 code set differ from Unicode?
What special techniques are required for handling GB18030?
Are there any (open source) libraries for handling GB18030?
How does the Chinese GB18030 code set differ from Unicode?
What special techniques are required for handling GB18030?
Are there any (open source) libraries for handling GB18030?
As per the Wikipedia article on GB18030, "GB18030 can be be considered a Unicode Transformation Format (i.e. an encoding of all Unicode code points) that maintains compatibility with a legacy character set." That is, all Unicode characters can be encoded in GB18030, but they will be encoded with different byte sequences than would be generated with UTF-8 or UTF-16. Handling the GB18030 encoding doesn't require any more special techniques than are required for any other non-Unicode encoding.
The ICU project is an open source library (for C or Java) that has full support for many different encodings, including GB18030. Information on converting between different encodings with ICU can be found here.
What special techniques are required for handling GB18030?
The biggest thing to be aware of is that, unlike UTF-8, GB18030 allows ASCII bytes to occur within the encoding of a multi-byte character. (For example, 'ß' is encoded as the bytes 81 30 89 38, which contains the ASCII encoding of '0' and '8'.) This means that you can't use a simple byte-oriented find
/index
function.