views:

40

answers:

3

In .NET, a string is a unicode character string. My understanding is the string itself does not contain any particular encoding information, ie is encoding neutral? You can use any encoding method to decode a string into a stream of bytes and then encode a stream of bytes into a recognizable string, as long as the encoding method matches with the decoding method?

+3  A: 

Yes, with the caveat that many encoding schemes can't hold all Unicode code points, which renders some round trips non-idempotent.

Marcelo Cantos
Hi I used a Chinese encoding scheme "Encoding.GetEncoding(936)" and StreamWriter class to write a Chinese character String into a .txt file. Why is it that the Notepad program can properly display the characters even though I did not specify to Notepad program which encoding I used? I know there is an auto byte mark detection, but I don't think it can auto detect an exotic encoding scheme?
Aperture
@Aperture: Notepad applies some heuristics to the start of the file to work out the encoding. If usually gets it right, but it is possible to fool it. [See Michael Kaplan's blog for more details.](http://blogs.msdn.com/b/michkap/archive/2007/04/22/2239345.aspx)
Richard
@Richard: Bravo! SO is full of knowledgable people!
Aperture
+1  A: 

"Unicode" in .NET is UTF-16 or UCS-2 (2 bytes). It is itself an encoding of full Unicode character set, which requires 32-bits (4 bytes, UCS-4) to hold all characters. So you can serialize the bytes as is and they will be restored on any system that supports UTF-16 will deserialize them properly.

Eugene Mayevski 'EldoS Corp
+4  A: 

In .Net string consists of UTF-16 characters. There is no such thing as "Unicode string". It could be UCS2 or UCS4 string, or various transition formats like UTF-7, UTF-8, UTF-16, but you could not call it "Unicode". It is important to understand the difference between them.

I know that somebody in .Net team called property of Encoding class "Unicode", but it was an error. And this class contains also "Default" property which is another mis-named property. This leads to many defects (majority of people don't read manuals and they simply don't realize that "Unicode" is UTF-16 and "Default" means default OS code page).

As for second part of your question, the answer is unfortunately no. It would be "yes", but there is one small problem. It is GB18030 encoding – the standard encoding for China PRC. It has assigned code points which simply don't exist in Unicode standard (yet). Possibly new version of Unicode standard will resolve this issue.

One important point here (going back to UTF-16) is the fact that bytes are not necessary good for conversions. The problem is related to surrogate pairs and you have to be careful as one character could be defined by two pairs, meaning four bytes.

If you don't care to support GB18030 encoding, you could use the method you mention freely. If by chance you want to sell your software in China, you will need to support it and of course you will have to be very careful (extensive testing will be needed).

Paweł Dyda
Hi thanks for clarify one of my long term question (what's the difference between UnicodeEncoding and UTF8Encoding). So Unicode=UTF-16 and UTF-8=UTF-8?
Aperture
@Paweł Dyda In case of the GB18030 encoding you mean it defines code points that don't have any coutnerparts in the 16-bit UNICODE standard, or even in the 32-bit UNICODE / ISO 10646 standard?
Ondrej Tucny
@Ondrey: From one point I meant that GB18030 defines more code points than ISO 10646, so there is no way to convert these additional code points. But I heard that GB18030:2005 assigns some glyphs that have no equivalents in Unicode 5.1. Am I wrong here?
Paweł Dyda
@Aperture: The problem is, there is no "Unicode Encoding". There is Unicode Character Set. To actually map these characters to something meaningful for computers, we use Encoding Scheme, meaning we assign them to some code points. These are 32 bits or 4 bytes and that's what UCS4 is all about. At the same time, there are so called Transformation Formats (UTF). You don't want to use four bytes to designate the character all the time, therefore UTF-16 exists – it is simply a different notation of UCS4, that uses two to four bytes to encode a UCS4 character. It is variable length encoding...
Paweł Dyda
@self: BTW. Current specification of Unicode standard is version 6.0. It adds 222 CJK characters, and I believe these are exactly the characters that were missing. Now we have to wait for programming tools that will support this standard. Will JDK7 be the first?
Paweł Dyda