I'm just looking at the constructors for StreamReader / Writer and I note it uses UTF8 as default. Anyone know why this is? I would have presumed it would have been a safer bet to default to Unicode.
views:
631answers:
4UTF-8 will work with any ASCII document, and is typically more compact than UTF-16 - but it still covers the whole of Unicode. I'd say that UTF-8 is far more common than UTF-16. It's also the default for XML (when there's no BOM and no explicit encoding specified).
Why do you think it would be better to default to UTF-16? (That's what Encoding.Unicode
is.)
EDIT: I suspect you're confused about exactly what UTF-8 can handle. This page describes it pretty clearly, including how any particular Unicode character is encoded. It's a variable-width encoding, but it covers the whole of Unicode.
UTF8 is Unicode, more specifically one of the Unicode encoding types.
More importantly its backwards compatible with ASCII, plus it's the standard default for XML and HTML
"Unicode" is the name of a standard, so there's no such encoding as "Unicode". Rather, there are two mapping methods: UTF and UCS.
As for "why" part, UTF-8 has maximum compatibility with ASCII.
As all the others already said, UTF-8 is an encoding standard within Unicode. UTF-8 uses a variable number of bytes to encode all unicode characters there are.
All ASCII characters are represented as is, such that ASCII files can be read with now further ado. As soon as a byte in the stream has its 8th bit (highest bit, > 127) set, this triggers the reader to combine it with the following byte until that is <128. The combination then is regarded as 1 character.
There are characters in LATIN-1 (ANSII), that are encoded using two characters: for example é is encoded as e and ´. Length('é') therefore is 2.
Windows uses UTF-16 internally, which limits the encodable characters to 64K, which is by no means all Unicde characters. UTF-32 for the time being allows for all characters, but is artificially limited too. And both are not upward compatible to ASCII, as the have leading zeros:
A = ASCII h41 = UTF-8 h41 = UTF-16 h0041 = UTF-32 h00000041
There are also little and big endian encodings:
A = UTF-16 big endian h0041 = UTF-16 little endian h4100
Imagine using UTF16 or UTF32 to save your files. They would (for text files) double or quadrouple in size as compared to ASCII and UTF-8 ( UTF-8 if only ascii characters are used). UTF-8 not only allows for all characters in the unicode standard, even for future enhancements, but saves it space efficiently as well.
Usually the first two bytes of a file, the BOM or Byte Order Marker, tell you, which encoding standard is used. If omitted, XML and StreamRedaer use UTF-8,as you found out. This again makes sence, as ASCII files do not have a BOM and therefore in most cases are read correctly. This might not be true for files using all of LATIN-1.