views:

803

answers:

1

This is an example raw email I am trying to parse:

MIME-version: 1.0
Content-type: text/html; charset=UTF-8
Content-transfer-encoding: quoted-printable
X-Mailer: Verizon Webmail
X-Originating-IP: [x.x.x.x]

=C2=A0test testing testing 123

What is =C2=A0? I have tried a half dozen quoted-printable parsers, but none handle this correctly. How would one properly parse this in C#?

Honestly, for now, I'm coding:

//TODO WTF
encoded = encoded.Replace("=C2=A0", "");

Because I can't figure out why that text is there randomly within the MIME content, and isn't supposed to be rendered into anything. By just removing it, I'm getting the desired effect - but WHY?!

To be clear, I know that (=[0-9A-F]{2}) is an encoded character. But in this case, it seemingly represents NOTHING.

+5  A: 

"=C2=A0" represents the bytes C2 A0. However, since this is UTF-8, it translates to 00A0, which is the Unicode for non-breaking space. See http://home.tiscali.nl/t876506/utf8tbl.html

Steven Sudit
What is the way to parse this in C#? All of the parsers I've tried operate on each char independently, and do this: int iHex = Convert.ToInt32(hex, 16); char c = (char)iHex;
TheSoftwareJedi
Does UTF-8 always encode in 2 bytes like this? Can I assume a match of (=[0-9A-F]{2}=[0-9A-F]{2}) instead of the single byte? Why the hell isn't there a parser for this?!?!?!?!
TheSoftwareJedi
If you read up on UTF-8, you'll see that any single-byte value that exceeds 7F has to be coded into two characters, and the first one will always have its high bit set. So, yes, A0 is always coded as C2 A0, which means you can't go byte-by-byte. The right way to handle UTF-8 with quoted-encoding is to first decode the quoted part and then decode the UTF-8, resulting in a string composed of 2-byte characters (technically UCS-16 or UTF-16).
Steven Sudit
Let me also add that I've used Chilkat's S/MIME control to parse email messages for me, and it does a really good job. It's also quite cheap.
Steven Sudit
Thanks Steven. I'll go ahead and purchase that because I'm sick of hacking this crap together. :)
TheSoftwareJedi
Actually, I *love* writing MIME parsers and such, but I simply can't justify spending days to produce something with a fraction of the functionality of a cheap, reliable third-party control. Even if I were paid minimum wage, it would not be cost-effective.
Steven Sudit