views:

68

answers:

1

Certain mail clients allow for the sender to place images directly in the body of their email (instead of as a traditional attachment). When I receive one of these emails in my application, I need to be able to look at only the text/plain message body and determine that the sender embedded an inline image.

I'm trying to craft a RegEx to find image placeholders in the text/plain message body so I can swap them for <img> tags in my own HTML-enabled version of the message. (Wacky, I know, but this is the requirement).

The problem I'm finding is that the placeholders differ based on the sending mail client. For example, when sent from MS Outlook, the text/plain body of the multi-part message looks like this:

Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Check out this image:

[cid:[email protected]]

Isn't it cool??

A similar message sent from Gmail is a little bit different:

Content-Type: text/plain; charset=ISO-8859-1

Check out this image:

[image: image001.jpg]

Isn't it cool??

The text/html body and image/jpeg part with the base64 encoded image follow.

Has anyone done any research on this before and compiled a list or built a RegEx specifically for this purpose?

I realize a more reliable way to achieve my goal is to look at the text/html portion of the message--which seems to be a bit more standardized from the few tests I've done--but unfortunately I don't have access to that in this scenario.

I'm using C#, if that matters to anyone.

Here's a list of text/plain image placeholders I've compiled thus far:

  • Gmail: [image: filename.jpg]
  • Outlook 2007: [cid:[email protected]]
  • Thunderbird 3.0.7: none
+1  A: 

I'd suggest to go with html part. If you want to find just a placeholder in plain text part, this very simple regular expression should be sufficient (PCRE):

^\[.*\]$

At least this is what works for examples above. If you'd like to identify image name, a bit complicated expression would be required. Mind that, this will catch all lines starting with [ and ending with ] no matter what the contents are. If you'd like to limit regexp to some file types, try this:

^\[.*(\.jpg|\.jpeg|\.png|\.gif|\.bmp).*\]$i

Examples will work in Perl, since you didn't mention language...

Paweł Dyda
Thanks! Unfortunately, I can't use the HTML part. I think it's pretty safe to look for images file formats within [ ]. Good idea. The big problem is figuring out what each client does, so I know how to build the RegEx. I wish Gmail included the "cid:" part. I'll test a few more email clients, too.
Rob Sobers
Well, these "inline" images are not standardized, so there is no way to tell how they would look like in various clients. However, MIME does standardize part headers, so you could get image name from its header and then look for this text in order to replace it.
Paweł Dyda
@Pawel: you're right. Unfortunately, all I've got access to is the plain text body. But I just realized, I also know the entire filename, so I can tighten up the RegEx by building it on the fly and matching the filename, e.g.: `^\[.*test\.jpg.*\]$`
Rob Sobers