views:

531

answers:

4

The reason I want to do this is to make it easy to parse out instructions that are emailed to a bot, the kind of thing majordomo might do to parse commands like subscribing and unsubscribing. It turns out there are a lot of crazy formats and things to deal with, like quoted text, distinguishing between header and body, etc.

A perl module to do this would be ideal but solutions in any language are welcome.

+2  A: 

Can't say I have every done exactly what you are talking about, but maybe you should give this a read as it sounds like the author is doing what you describe.

Parsing MIME & HTML

Shane
Thanks, very helpful indeed. Note missing final "l" in the URL.
dreeves
A: 

Some ideas: http://news.ycombinator.com/item?id=666607

Here's my incomplete solution, which actually works for my purposes (parsing commands emailed to a bot). I'm keeping it here for reference until there's a definitively better answer.

# Take an email as a big string and turn it into a plain ascii equivalent.
# TODO: leave any html tags inside of quotes alone.
sub plainify {
  my($email) = @_;

  # translate quoted-printable or whatever this crap is to plain text.
  $email =~ s/\=0D\=0A/\n/gs;
  $email =~ s/\=0A/\n/gs;
  $email =~ s/\=A0/ /gs;
  $email =~ s/\=2E/\./gs;
  $email =~ s/\=20/\ /gs;
  $email =~ s/\=([\n\r]|\n\r|\r\n)//gs;

  # translate html to plain text (or enough of it to parse commands).
  $email =~ s/\&nbsp\;/ /gs;
  $email =~ s/\<br\>/\n/gis;
  $email =~ s/(\<[^\>]+\>)/\n$1\n/gs;

  return $email
}
dreeves
A: 

You could do worse than look at CPAN for email-related modules.

One that I've used in the past for breaking out subjects, and bodies has been Email::Simple

Steve Kemp
+3  A: 

Python has the email.

>>> import email
>>> p = email.Parser.Parser()
>>> msg = p.parsestr("From: [email protected]\nSubject: Hello\nDear Sir or Madam...")
>>> msg.get("Subject")
Hello
>>> msg.get_payload()
'Dear Sir or Madam...'

It supports MIME and pretty much all encodings that are included in Python. HTML will just be text to it, but you can use BeautifulSoup or Tidy+ElementTree to get the text out of it.

Torsten Marek