views:

613

answers:

4

The snippet says it all :-)

UTF8Encoding enc = new UTF8Encoding(true/*include Byte Order Mark*/);
byte[] data = enc.GetBytes("a");
// data has length 1.
// I expected the BOM to be included. What's up?
+5  A: 

You wouldn't want it to be used for every call to GetBytes, otherwise you'd have no way of (say) writing a file a line at a time.

By exposing it with GetPreamble, callers can insert the preamble just at the appropriate point (i.e. at the start of their data). I agree that the documentation could be a lot clearer though.

Jon Skeet
In general, you should be able to ignore the preamble, since your writer will insert it based on your encoding choice.
Ishmael
+3  A: 

Because it is expected that GetBytes() will be called lots of times... you need to use:

byte[] preamble = enc.GetPreamble();

(only call it at the start of a sequence) and write that; this is where the BOM lives.

Marc Gravell
+2  A: 

Thank you both. The following works, and LINQ makes the combination simple :-)

UTF8Encoding enc = new UTF8Encoding(true);
byte[] data = enc.GetBytes("a");
byte[] combo = enc.GetPreamble().Concat(data).ToArray();
frou
+2  A: 

Note that in general, you don't need the Byte Order Mark for UTF-8 anyway. It's main purpose is to tell UTF16 BE and UTF16 LE apart. There is no such thing as UTF8 LE and UTF8 BE.

MSalters
It also allows you to differentiate UTF-8 files from ANSI files.
Ishmael
Even Microsoft admits "ANSI" is a confusing name - even when it's used to describe a charset. "ANSI files" don't exist anyway; on Windows all files are binary (Mainframes did have true text files, but they didn't have "Microsoft ANSI")
MSalters