tags:

views:

139

answers:

2

Hello,

I am pulling French emails from a mailbox and the emails contain accents. I believe it is using UTF8 encoding.

I have tried different UTF8 conversion methods I've found around the Internet but have been unsuccessful.

How, for example, in C#, do I convert this: Montr=C3=A9al to Montréal?

Edit: Also, it is inconsistent. Sometimes it may be like Montr& eacute;al. (The space after the ampersand is just added so the browser does not convert it.)

Thanks!! Mark

+1  A: 

That's not UTF-8. That's quoted printable, which quite isn't the same sort of encoding as UTF-8 - it's more an "ASCII text to Unicode text" encoding.

Quoted printable will effectively allow you to convert the ASCII message into a byte array which can then be decoded as UTF-8.

I'm not sure whether there's any direct support in .NET for quoted printable encoding, which is somewhat bizarre... I may well have missed something.

Jon Skeet
I don't think that is quite it. When I convert from QP I get this:Montréal. It is taking each =C3 and =A9 as a separate entity. However, they need to be interpreted together to get the é.
@user390480: That suggests that you're not converting from QP properly. You need to convert from QP to binary, and *then* use UTF-8 encoding to decode that binary to text.
Jon Skeet
A: 

The UTF-8 encoding translates an array of bytes (8-bit numbers) to a string (or vice versa). I.e. there is a mapping between "numbers" and "characters". The set of characters is larger than the set of ASCII characters, for example é is part of UTF-8, but not part of ASCII.

Quoted-Prinable encoding translates an array of bytes (8-bit number) to a sequence of ASCII characters (actually a subset of it).

Thus, combining both you can "encode" a UTF-8 string into a sequence of (a subset) of ASCII characters (ASCII string).

The same can be done with other encodings (e.g. ISO-8859-1). Thus you need to have both information:

  • The given ASCII string is quoted printable.
  • The resulting byte array represents a string having encoding UTF-8.

Decoding quoted-printable thus has two steps:

  1. Create the byte array say bytes[] via the quoted printable rules, i.e.

    • The substring =NM maps to a byte NM (where NM is hexadecimal) ("N*16 + M")
    • Any other character maps to its ASCII byte (Note that the similar q-encoded-word has an additional mapping for the _ to space)
  2. Then interpret the byte array as UTF-8 string.

Christian Fries