views:

274

answers:

4

I've tracked a problem I'm having down to the following inexplicable behaviour within the .NET System.Text.Encoding class:

byte[] original = new byte[] { 128 };
string encoded = System.Text.Encoding.UTF8.GetString(original);
byte[] decoded = System.Text.Encoding.UTF8.GetBytes(encoded);
Console.WriteLine(original[0] == decoded[0]);

Am I expecting too much that decoded should equal original in the above?

UTF8, UTF7, UTF32, Unicode and ASCII all produce various varieties of wrongness. What's going on?

+1  A: 

This is because when you convert to string it will contain the UTF-8 BOM which are three bytes in the beginning.

Darin Dimitrov
I noted that if you instead use 127 for the byte value, the decoded byte array contains exactly one byte, having the value 127. What happens at 128?
Fredrik Mörk
At 128, you leave ASCII world and enter characters that change based on encoding.
jvenema
The UTF-8 BOM is EF BB BF. This is not the case here. It is the replacement character. See my answer.
Mark Byers
@jvenema: I am aware of that. Strangely enough, using the byte value 239 produces the same result. All other bytes in the range 128-255 outputs `False` using the OP code sample.
Fredrik Mörk
+4  A: 

This is invalid UTF8 byte sequence.

You need

byte[] original = new byte[] { 0xc2, 128 };

Nothing to do with byte order marks.

Update

Or preferably you should do

char[] c = { (char)128 };
leppie
+4  A: 

The original data is an invalid UTF8 sequence.

decoded = { 0xef, 0xbf, 0xbd }

Searching for this string turned up this: http://en.wikipedia.org/wiki/Unicode%5FSpecials. It is the UTF-8 code for the replacement character, used instead of invalid characters.

Mark Byers
+1  A: 

In general you can't roundtrip in this way and you are wrong to expect to be able to do so for an arbitrary encoding and in particular for any of the UTF encodings.

However there is an encoding that will allow you to roundtrip for all byte values - Latin1 aka ISO-8859-1 aka CP28591. This encoding is similar but not identical to the default Windows ANSI encoding and is useful for scenarios where roundtripping in this way is important - e.g. writing a stream that mixes text and control characters to a serial port.

See this answer, or other questions that mention Latin1.

Joe