ansaurus

Question

Strings encoded ASCII and UTF8 have different lengths!

Answer 1

A:

Perhaps the message contained some characters that couldn't be encoded as a single byte in UTF-8.

Martin Törnwall 2010-10-08 22:03:25

Answer 2

+4 A:

UTF-8 handles different the strings than ASCII: On UTF-8, each character may be of 1, 2 or 3 bytes length. However, ASCII considers each byte as a character. The C# UTF-8 encoder counts well-formed UTF-8 characters, instead of bytes. I hope this helps you.

Neo Adonis 2010-10-08 22:03:27

I think this is right. Note that `ASCIIEncoding` doesn't have error detection, but `UTF8Encoding` does.

Matthew Flaschen 2010-10-08 22:05:08

Huh? Error detection? What?

Timwi 2010-10-08 22:06:39

As noted at the docs, [`ASCIIEncoding`](http://msdn.microsoft.com/en-us/library/system.text.asciiencoding.asciiencoding.aspx) does not have error detection. So it will happily "decode" bytes that make no sense as ASCII into question marks.

Matthew Flaschen 2010-10-08 22:11:15

@Matthew: How is that any different from `UTF8Encoding`? It will happy “decode” byte sequences that make no sense as UTF-8 into `U+FFFD`...

Timwi 2010-10-08 23:04:52

@Timwi, as I said before [`UTF8Encoding`](http://msdn.microsoft.com/en-us/library/302sbf78.aspx) has error detection, which means you can tell it to throw an exception.

Matthew Flaschen 2010-10-08 23:40:07

@Matthew: I see, you’re talking about a boolean parameter on the constructor. That really wasn’t clear from “it has error detection”. Also, I don’t see how that’s relevant to this answer...

Timwi 2010-10-09 00:55:37

@Timwi, yes, "error detection" is the term the documentation uses, not one I devised. If `ASCIIEncoding` had error detection, you could have it throw in situations like the question, where it's being fed invalid bytes. So I definitely find the difference between the two classes relevant.

Matthew Flaschen 2010-10-09 03:32:47

Answer 3

+4 A:

Because when decoding bytes, ASCIIEncoding replaces all bytes greater than 127 (0x7F) with a question mark (?) which is one character, while UTF8Encoding decodes UTF-8 multi-byte sequences correctly into single characters (for example, the three bytes 232,170,158 become the single character 語).

Timwi 2010-10-08 22:04:17

Answer 4

+3 A:

That's because the stream is actually UTF-8 encoded. If it was ASCII encoded, the strings would be identical.

When read as ASCII, the byte combinations that represent characters outside the 0-127 code set will be read as separate characters, and they will look like garbage.

When read as UTF-8, the byte combinations will be decoded into the correct characters, each multi-byte combination ending up as a single character.

(Note: Strings are not encoded, it's the stream that is encoded. You decode the stream from ASCII or UTF-8 into a Unicode character string.)

Guffa 2010-10-08 22:10:33

ansaurus

tags:

views:

answers:

Strings encoded ASCII and UTF8 have different lengths!

related questions