tags:

views:

72

answers:

4

I'm reading a stream and am wondering why the UTF-8 encoded string is shorter than the ASCII one.

  ASCIIEncoding encoder = new ASCIIEncoding();
  UTF8Encoding enc = new UTF8Encoding();   
  string response = encoder.GetString(message, 0, bytesRead); //4096
  string responseUtf8 = enc.GetString(message, 0, bytesRead);  //3955
A: 

Perhaps the message contained some characters that couldn't be encoded as a single byte in UTF-8.

Martin Törnwall
+4  A: 

UTF-8 handles different the strings than ASCII: On UTF-8, each character may be of 1, 2 or 3 bytes length. However, ASCII considers each byte as a character. The C# UTF-8 encoder counts well-formed UTF-8 characters, instead of bytes. I hope this helps you.

Neo Adonis
I think this is right. Note that `ASCIIEncoding` doesn't have error detection, but `UTF8Encoding` does.
Matthew Flaschen
Huh? Error detection? What?
Timwi
As noted at the docs, [`ASCIIEncoding`](http://msdn.microsoft.com/en-us/library/system.text.asciiencoding.asciiencoding.aspx) does not have error detection. So it will happily "decode" bytes that make no sense as ASCII into question marks.
Matthew Flaschen
@Matthew: How is that any different from `UTF8Encoding`? It will happy “decode” byte sequences that make no sense as UTF-8 into `U+FFFD`...
Timwi
@Timwi, as I said before [`UTF8Encoding`](http://msdn.microsoft.com/en-us/library/302sbf78.aspx) has error detection, which means you can tell it to throw an exception.
Matthew Flaschen
@Matthew: I see, you’re talking about a boolean parameter on the constructor. That really wasn’t clear from “it has error detection”. Also, I don’t see how that’s relevant to this answer...
Timwi
@Timwi, yes, "error detection" is the term the documentation uses, not one I devised. If `ASCIIEncoding` had error detection, you could have it throw in situations like the question, where it's being fed invalid bytes. So I definitely find the difference between the two classes relevant.
Matthew Flaschen
+4  A: 

Because when decoding bytes, ASCIIEncoding replaces all bytes greater than 127 (0x7F) with a question mark (?) which is one character, while UTF8Encoding decodes UTF-8 multi-byte sequences correctly into single characters (for example, the three bytes 232,170,158 become the single character 語).

Timwi
+3  A: 

That's because the stream is actually UTF-8 encoded. If it was ASCII encoded, the strings would be identical.

When read as ASCII, the byte combinations that represent characters outside the 0-127 code set will be read as separate characters, and they will look like garbage.

When read as UTF-8, the byte combinations will be decoded into the correct characters, each multi-byte combination ending up as a single character.

(Note: Strings are not encoded, it's the stream that is encoded. You decode the stream from ASCII or UTF-8 into a Unicode character string.)

Guffa