UTF8 (Quoted Printable) conversion in C# question

tags:

c#
utf-8

views:

139

answers:

+2 Q:

UTF8 (Quoted Printable) conversion in C# question

Hello,

I am pulling French emails from a mailbox and the emails contain accents. I believe it is using UTF8 encoding.

I have tried different UTF8 conversion methods I've found around the Internet but have been unsuccessful.

How, for example, in C#, do I convert this: Montr=C3=A9al to Montréal?

Edit: Also, it is inconsistent. Sometimes it may be like Montr& eacute;al. (The space after the ampersand is just added so the browser does not convert it.)

Thanks!! Mark

+1 A:

That's not UTF-8. That's quoted printable, which quite isn't the same sort of encoding as UTF-8 - it's more an "ASCII text to Unicode text" encoding.

Quoted printable will effectively allow you to convert the ASCII message into a byte array which can then be decoded as UTF-8.

I'm not sure whether there's any direct support in .NET for quoted printable encoding, which is somewhat bizarre... I may well have missed something.

Jon Skeet 2010-07-20 12:42:31

I don't think that is quite it. When I convert from QP I get this:MontrÃ©al. It is taking each =C3 and =A9 as a separate entity. However, they need to be interpreted together to get the é.

2010-07-20 13:33:09

@user390480: That suggests that you're not converting from QP properly. You need to convert from QP to binary, and *then* use UTF-8 encoding to decode that binary to text.

Jon Skeet 2010-07-20 14:12:53

The UTF-8 encoding translates an array of bytes (8-bit numbers) to a string (or vice versa). I.e. there is a mapping between "numbers" and "characters". The set of characters is larger than the set of ASCII characters, for example é is part of UTF-8, but not part of ASCII.

Quoted-Prinable encoding translates an array of bytes (8-bit number) to a sequence of ASCII characters (actually a subset of it).

Thus, combining both you can "encode" a UTF-8 string into a sequence of (a subset) of ASCII characters (ASCII string).

The same can be done with other encodings (e.g. ISO-8859-1). Thus you need to have both information:

The given ASCII string is quoted printable.
The resulting byte array represents a string having encoding UTF-8.

Decoding quoted-printable thus has two steps:

Create the byte array say bytes[] via the quoted printable rules, i.e.
- The substring =NM maps to a byte NM (where NM is hexadecimal) ("N*16 + M")
- Any other character maps to its ASCII byte (Note that the similar q-encoded-word has an additional mapping for the _ to space)
Then interpret the byte array as UTF-8 string.

Christian Fries 2010-09-17 09:08:59

ansaurus

tags:

views:

answers:

UTF8 (Quoted Printable) conversion in C# question

related questions