i am building a system for automatically parsing incoming emails and populating a database from them
initially there will only be 10-20 expected formats coming in, but long term there is the possibility of thousands of different formats
the way i see it
- i need to identify format of email (eg regex on subject line)
- parse the email with the correct processor
- check the data is realistic, maybe flag some for manual check
- populate the database
what i am after is suggestions on how to structure this, eg do i store the formats in the database or flat file, the system needs to be flexible, it might be that subject line detection is not enough and i might also have to scan the email headers.
the data itself could be in the email body or attachments such as pdf, excel files etc
a prime example of this sort of thing is the likes of picasa photo gallery where you can email your photos to a specific email address and it automatically extracts them and puts it in a gallery for you