tags:

views:

3463

answers:

7

i'm looking for good/working/simple to use php code for parsing raw email into parts.

i've written a couple of brute force solutions, but everytime, one small change/header/space/something comes along and my whole parser fails and the project falls apart.

and before i get pointed at PEAR/PECL, i need actual code. my host has some screwy config or something, i can never seem to get the .so's to build right. if i do get the .so made, some difference in path/environment/php.ini doesnt always make it available (apache vs cron vs cli)

oh, and one last thing, i'm parsing the raw email text, NOT pop3, and NOT imap. its being piped into the php script via a .qmail email redirect.

i'm not expecting SOF to write it for me, i'm looking for some tips/starting points on doing it "right". this is one of those "wheel" problems that i know has already been solved.

+5  A: 

What are you hoping to end up with at the end? The body, the subject, the sender, an attachment? You should spend some time with RFC2822 to understand the format of the mail, but here's the simplest rules for well formed email:

HEADERS\n
\n
BODY

That is, the first blank line (double newline) is the separator between the HEADERS and the BODY. A HEADER looks like this:

HSTRING:HTEXT

HSTRING always starts at the beginning of a line and doesn't contain any white space or colons. HTEXT can contain a wide variety of text, including newlines as long as the newline char is followed by whitespace.

The "BODY" is really just any data that follows the first double newline. (There are different rules if you are transmitting mail via SMTP, but processing it over a pipe you don't have to worry about that).

So, in really simple, circa-1982 RFC822 terms, an email looks like this:

HEADER: HEADER TEXT
HEADER: MORE HEADER TEXT
  INCLUDING A LINE CONTINUATION
HEADER: LAST HEADER

THIS IS ANY
ARBITRARY DATA
(FOR THE MOST PART)

Most modern email is more complex than that though. Headers can be encoded for charsets or RFC2047 mime words, or a ton of other stuff I'm not thinking of right now. The bodies are really hard to roll your own code for these days to if you want them to be meaningful. Almost all email that's generated by an MUA will be MIME encoded. That might be uuencoded text, it might be html, it might be a uuencoded excel spreadsheet.

I hope this helps provide a framework for understanding some of the very elemental buckets of email. If you provide more background on what you are trying to do with the data I (or someone else) might be able to provide better direction.

jj33
A: 

yeah, ive been able to write a basic parser, based off that rfc and some other basic tutorials. but its the multipart mime nested boundaries that keep messing me up.

i found out that MMS (not SMS) messages sent from my phone are just standard emails, so i have a system that reads the incoming email, checks the from (to only allow from my phone), and uses the body part to run different commands on my server. its sort of like a remote control by email.

because the system is designed to send pictures, its got a bunch of differently encoded parts. a mms.smil.txt part, a text/plain (which is useless, just says 'this is a html message'), a application/smil part (which the part that phones would pic up on), a text/html part with a advertisement for my carrier, then my message, but all wrapped in html, then finally a textfile attachment with my plain message (which is the part i use) (if i shove an image as an attachment in the message, its put at attachment 1, base64 encoded, then my text portion is attached as attachment 2)

i had it working with the exact mail format from my carrier, but when i ran a message from someone elses phone through it, it failed in a whole bunch of miserable ways.

i have other projects i'd like to extend this phone->mail->parse->command system to, but i need to have a stable/solid/generic parser to get the different parts out of the mail to use it.

my end goal would be to have a function that i could feed the raw piped mail into, and get back a big array with associative sub-arrays of headers var:val pairs, and one for the body text as a whole string

the more and more i search on this, the more i find the same thing: giant overdeveloped mail handling packages that do everything under the sun thats related to mails, or useless (to me, in this project) tutorials.

i think i'm going to have to bite the bullet and just carefully write something my self.

Uberfuzzy
+1  A: 

You're probably not going to have much fun writing your own MIME parser. The reason you are finding "overdeveloped mail handling packages" is because MIME is a really complex set of rules/formats/encodings. MIME parts can be recursive, which is part of the fun. I think your best bet is to write the best MIME handler you can, parse a message, throw away everything that's not text/plain or text/html, and then force the command in the incoming string to be prefixed with COMMAND: or something similar so that you can find it in the muck. If you start with rules like that you have a decent chance of handling new providers, but you should be ready to tweak if a new provider comes along (or heck, if your current provider chooses to change their messaging architecture).

jj33
+1  A: 

I'm not sure if this will be of help to you - hope so - but it will surely help others interested in finding out more about email. Marcus Bointon did one of the best presentations entitled "Mail() and life after Mail()" at the PHP London conference in March this year and the slides and MP3 are online. He speaks with some authority, having worked extensively with email and PHP at a deep level.

My perception is that you are in for a world of pain trying to write a truly generic parser.

Flubba
A: 

@jj33 yeah its the recursion part of mime that that keeps getting me, the rest i have a decent grasp on after rewriting this a couple times.

@Flubba that actually was very insightful, i flipped through the slides. i'll have to listen to the mp3 later.

Uberfuzzy
A: 

@Flubba can you re-up the slides / mp3? it gives page not found at the moment.

bibstha