views:

192

answers:

1

I'm going to be reading and parsing the EML files dropped by the Microsoft SMTP service. I am a newbie to using the various stream classes. The implementation I have seen that parses these files uses a variation on System.IO.Stream to read byte by byte. However, it seems like these files should never be anything but text. Wouldn't it be better to use a StreamReader? And if so, is there any reason to use something other than the default (UTF-8) encoding?

+1  A: 

They should be text, but they aren't always.

Emails can be 8 Bit or Binary encoded.

A StreamReader will work for about 99% of the emails you want to parse.

However, quite honestly, that's not the biggest problem.

The problem will be actually parsing and extracting the Mime content, according to Mime rules, along with using the correct characterset.

Although UTF-8 is a very large characterset, and it can be used to parse a majority of emails, you can still get corrupt content by attempting to parse emails with that characterset.

The best way to do this, is to actually read the email in a binary form, extract the characterset, then switch to reading the email using the specified characterset found in the headers.

Cheers!

dave

dave wanta
Thanks for the answer. I see the charset parameter of the "Content-Type" header in RFC 2046 sec. 4.1.2. Is this the correct piece of information?
Chris Simmons
Yes. One thing you can do, is scan ahead for that value, find it, set the encoding, and then re-read the email with a StreamReader created with that charset.
dave wanta
Sounds like a plan. Thanks again.
Chris Simmons