Is a (unicode) String encoding neutral?

In .Net string consists of UTF-16 characters. There is no such thing as "Unicode string". It could be UCS2 or UCS4 string, or various transition formats like UTF-7, UTF-8, UTF-16, but you could not call it "Unicode". It is important to understand the difference between them.

I know that somebody in .Net team called property of Encoding class "Unicode", but it was an error. And this class contains also "Default" property which is another mis-named property. This leads to many defects (majority of people don't read manuals and they simply don't realize that "Unicode" is UTF-16 and "Default" means default OS code page).

As for second part of your question, the answer is unfortunately no. It would be "yes", but there is one small problem. It is GB18030 encoding – the standard encoding for China PRC. It has assigned code points which simply don't exist in Unicode standard (yet). Possibly new version of Unicode standard will resolve this issue.

One important point here (going back to UTF-16) is the fact that bytes are not necessary good for conversions. The problem is related to surrogate pairs and you have to be careful as one character could be defined by two pairs, meaning four bytes.

If you don't care to support GB18030 encoding, you could use the method you mention freely. If by chance you want to sell your software in China, you will need to support it and of course you will have to be very careful (extensive testing will be needed).

Hi I used a Chinese encoding scheme "Encoding.GetEncoding(936)" and StreamWriter class to write a Chinese character String into a .txt file. Why is it that the Notepad program can properly display the characters even though I did not specify to Notepad program which encoding I used? I know there is an auto byte mark detection, but I don't think it can auto detect an exotic encoding scheme?

Aperture 2010-10-17 09:17:55

@Aperture: Notepad applies some heuristics to the start of the file to work out the encoding. If usually gets it right, but it is possible to fool it. [See Michael Kaplan's blog for more details.](http://blogs.msdn.com/b/michkap/archive/2007/04/22/2239345.aspx)

Richard 2010-10-17 09:54:32

@Richard: Bravo! SO is full of knowledgable people!

Aperture 2010-10-17 11:28:31

Hi thanks for clarify one of my long term question (what's the difference between UnicodeEncoding and UTF8Encoding). So Unicode=UTF-16 and UTF-8=UTF-8?

Aperture 2010-10-17 09:22:52

@Paweł Dyda In case of the GB18030 encoding you mean it defines code points that don't have any coutnerparts in the 16-bit UNICODE standard, or even in the 32-bit UNICODE / ISO 10646 standard?

Ondrej Tucny 2010-10-17 09:23:49

@Ondrey: From one point I meant that GB18030 defines more code points than ISO 10646, so there is no way to convert these additional code points. But I heard that GB18030:2005 assigns some glyphs that have no equivalents in Unicode 5.1. Am I wrong here?

Paweł Dyda 2010-10-17 09:32:45

@Aperture: The problem is, there is no "Unicode Encoding". There is Unicode Character Set. To actually map these characters to something meaningful for computers, we use Encoding Scheme, meaning we assign them to some code points. These are 32 bits or 4 bytes and that's what UCS4 is all about. At the same time, there are so called Transformation Formats (UTF). You don't want to use four bytes to designate the character all the time, therefore UTF-16 exists – it is simply a different notation of UCS4, that uses two to four bytes to encode a UCS4 character. It is variable length encoding...

Paweł Dyda 2010-10-17 09:43:31

@self: BTW. Current specification of Unicode standard is version 6.0. It adds 222 CJK characters, and I believe these are exactly the characters that were missing. Now we have to wait for programming tools that will support this standard. Will JDK7 be the first?

Paweł Dyda 2010-10-18 19:21:04

ansaurus

tags:

views:

answers:

Is a (unicode) String encoding neutral?

related questions