views:

178

answers:

3

i am building a system for automatically parsing incoming emails and populating a database from them

initially there will only be 10-20 expected formats coming in, but long term there is the possibility of thousands of different formats

the way i see it

  1. i need to identify format of email (eg regex on subject line)
  2. parse the email with the correct processor
  3. check the data is realistic, maybe flag some for manual check
  4. populate the database

what i am after is suggestions on how to structure this, eg do i store the formats in the database or flat file, the system needs to be flexible, it might be that subject line detection is not enough and i might also have to scan the email headers.

the data itself could be in the email body or attachments such as pdf, excel files etc

a prime example of this sort of thing is the likes of picasa photo gallery where you can email your photos to a specific email address and it automatically extracts them and puts it in a gallery for you

A: 

What you probably want to do is first parse through the headers and subject line, then import the correct format via the database. Because there are potentially thousands of formats, a database is going to be the easiest way because it is dynamic. No use creating thousands of files.

Chacha102
+2  A: 

Probably not the most famous answer, but have you look at standard ways to do so, like procmail? Provides you with a basic understanding of emails and allows you to build filters around everything. (Processing mails through a file-type detector first, applying regexps to all possible headers,...)

That way you keep every part of your system in a specialised script/program and produce a modular solution that can easily be extended. Plus you may use any tool that has already been programmed by somebody else.

For the file-type filter: I am doing something comparable for broken/old pgp-mails via procmail to add a content type.

# repair pgp-encoded messages with missing Content-Type
######################################################################

:0
* !^Content-Type: message/
* !^Content-Type: multipart/
* !^Content-Type: application/pgp
{
   :0 fBw
   * ^-----BEGIN PGP MESSAGE-----
   * ^-----END PGP MESSAGE-----
   | /usr/bin/formail \
       -i "Content-Type: application/pgp; format=text; x-action=encrypt"

   :0 fBw
   * ^-----BEGIN PGP SIGNED MESSAGE-----
   * ^-----BEGIN PGP SIGNATURE-----
   * ^-----END PGP SIGNATURE-----
   | /usr/bin/formail \
       -i "Content-Type: application/pgp; format=text; x-action=sign"
}

Further processing then could match content types and assign special handlers to special types (and generic handlers to unknown types).

Don Johe
A: 

use PHPMailer library for it.

dusoft
isn't that just for sending mails ?
bumperbox