views:

631

answers:

4

I'm just looking at the constructors for StreamReader / Writer and I note it uses UTF8 as default. Anyone know why this is? I would have presumed it would have been a safer bet to default to Unicode.

+5  A: 

UTF-8 will work with any ASCII document, and is typically more compact than UTF-16 - but it still covers the whole of Unicode. I'd say that UTF-8 is far more common than UTF-16. It's also the default for XML (when there's no BOM and no explicit encoding specified).

Why do you think it would be better to default to UTF-16? (That's what Encoding.Unicode is.)

EDIT: I suspect you're confused about exactly what UTF-8 can handle. This page describes it pretty clearly, including how any particular Unicode character is encoded. It's a variable-width encoding, but it covers the whole of Unicode.

Jon Skeet
I would assume that (correct me as i'm wrong ;) ) as .NET natively uses UTF16 for strings there are going to be scenarios (perhaps in different cultures) where it attempts to output a character that UTF8 cannot handle.
Quibblesome
UTF-8 handles all Unicode characters.
Jon Skeet
@Quarrelsome UTF-8 is a variable-length character encoding, and it's able to represent _any_ character from the Unicode standard. It'll just use more octets (8-bit bytes), up to 4 of them.
Anton Gogolev
Hoorah, I am learning! TY.
Quibblesome
+5  A: 

UTF8 is Unicode, more specifically one of the Unicode encoding types.

More importantly its backwards compatible with ASCII, plus it's the standard default for XML and HTML

blowdart
+1  A: 

"Unicode" is the name of a standard, so there's no such encoding as "Unicode". Rather, there are two mapping methods: UTF and UCS.

As for "why" part, UTF-8 has maximum compatibility with ASCII.

Anton Gogolev
Well, in the .NET framework the UTF-16 encoding is called Unicode. (The Encoding.Unicode property.) That doesn't help with the confusion. ;)
Guffa
+2  A: 

As all the others already said, UTF-8 is an encoding standard within Unicode. UTF-8 uses a variable number of bytes to encode all unicode characters there are.

All ASCII characters are represented as is, such that ASCII files can be read with now further ado. As soon as a byte in the stream has its 8th bit (highest bit, > 127) set, this triggers the reader to combine it with the following byte until that is <128. The combination then is regarded as 1 character.

There are characters in LATIN-1 (ANSII), that are encoded using two characters: for example é is encoded as e and ´. Length('é') therefore is 2.

Windows uses UTF-16 internally, which limits the encodable characters to 64K, which is by no means all Unicde characters. UTF-32 for the time being allows for all characters, but is artificially limited too. And both are not upward compatible to ASCII, as the have leading zeros:

A = ASCII h41 = UTF-8 h41 = UTF-16 h0041 = UTF-32 h00000041

There are also little and big endian encodings:

A = UTF-16 big endian h0041 = UTF-16 little endian h4100

Imagine using UTF16 or UTF32 to save your files. They would (for text files) double or quadrouple in size as compared to ASCII and UTF-8 ( UTF-8 if only ascii characters are used). UTF-8 not only allows for all characters in the unicode standard, even for future enhancements, but saves it space efficiently as well.

Usually the first two bytes of a file, the BOM or Byte Order Marker, tell you, which encoding standard is used. If omitted, XML and StreamRedaer use UTF-8,as you found out. This again makes sence, as ASCII files do not have a BOM and therefore in most cases are read correctly. This might not be true for files using all of LATIN-1.

Ralph Rickenbach