views:

306

answers:

2

I need to be able to take an arbitrary text input that may have a byte order marker (BOM) on it to mark its encoding, and output it as ASCII. We have some old tools that don't understand BOM's and I need to send them ASCII-only data.

Now, I just got done writing this code and I just can't quite believe the inefficiency here. Four copies of the data, not to mention any intermediate buffers internally in StreamReader. Is there a better way to do this?

// i_fileBytes is an incoming byte[]

string unicodeString = new StreamReader(new MemoryStream(i_fileBytes)).ReadToEnd();
byte[] unicodeBytes  = Encoding.Unicode.GetBytes(unicodeString.ToCharArray());
byte[] ansiBytes     = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
string ansiString    = Encoding.ASCII.GetString(ansiBytes);

I need the StreamReader() because it has an internal BOM detector to choose the encoding to read the rest of the file. Then the rest is just to make it convert into the final ASCII string.

Is there a better way to do this?

+1  A: 

If you've got i_fileBytes in memory already, you can just check whether or not it starts with a BOM, and then convert either the whole of it or just the bit after the BOM using Encoding.Unicode.GetString. (Use the overload which lets you specify an index and length.)

So as code:

int start = (i_fileBytes[0] == 0xff && i_fileBytes[1] == 0xfe) ? 2 : 0;
string text = Encoding.Unicode.GetString(i_fileBytes, start, i_fileBytes.Length-start);

Note that that assumes a genuinely little-endian UTF-16 encoding, however. If you really need to detect the encoding first, you could either reimplement what StreamReader does, or perhaps just build a StreamReader from the first (say) 10 bytes, and use the CurrentEncoding property to work out what you should use for the encoding.

EDIT: Now, as for the conversion to ASCII - if you really only need it as a .NET string, then presumably all you want to do is replace any non-ASCII characters with "?" or something similar. (Alternatively it might be better to throw an exception... that's up to you, of course.)

EDIT: Note that when detecting the encoding, it would be a good idea to just call Read() a single time to read one character. Don't call ReadToEnd() as by picking 10 bytes as an arbitrary amount of data, it might end mid-character. I don't know offhand whether that would throw an exception, but it has no benefits anyway...

Jon Skeet
Yeah, this is what I was considering and wanting to avoid. I can use Reflector to extract the BOM detection stuff from StreamReader. Not very clean and future-proof though.Using StreamReader to just grab the first 10 bytes is interesting though. Good idea!
Scott Bilas
A: 
System.Text.Encoding.ASCII.GetBytes(new StreamReader(new MemoryStream(i_fileBytes)).ReadToEnd())

That should save a few round-trips.

Joshua