views:

142

answers:

2

I have a daily process that relies on flat files delivered to a "drop box" directory on file system, this kicks off a load of this comma-delimited (from external company's excel etc) data into a database, a piecemeal Perl/Bash application, this database is used by multiple applications as well as edited directly with some GUI tools. Some of the data then gets replicated with some additional Perl app into the database that I mainly use.

Needless to say, all that is complicated and error prone, data coming in is sometimes corrupt or sometimes an edit breaks it. My users often complain about missing or incorrect data. Diffing the flat files and DBs to analyze where the process breaks is time consuming, and which each passing day data becomes more out of data and difficult to analyze.

I plan to fix or rewrite parts or all of this data transfer process.

I am looking on recommended reading before I embark on this, websites and articles on how to write robust, failure resistant and auto-recoverable ETL processes or other advice would be appreciated.

+1  A: 

This is precisely what Message Queue Managers are designed for. Some examples are here.

Seun Osewa
Excellent point, that certainly points me in the right direction regarding the rewrite of this mess. +1
Ville M
+1  A: 

You don't say what database backend you have, but in SQL Server I would write this as an SSIS package. We have a system designed to also write data to a meta data database that tells us when teh file was picked up, whether it processed successfully and why if it did not. It also tells things like how many rows the file had (which we can then use to determine if the current row size is abnormal). One of the beauties of SSIS is that I can set up configurations on package connections and variables, sothat moving the package from development to prod is easy (I don't have to go in and manaually change the connections each time once I have a configuration set up in the config table)

In SSIS we do various checks to ensure the data is correct or clean up the data before inserting to our database. Actually we do lots and lots of checks. Questionable records can be removed from the file processing and put in a separate location for the dbas to examine and possibly pass back to the customer. We can also check to see if the data in various columns (and the column names if given, not all files have them) is what would be expected. So if the zipcode field suddently has 250 characters, we know something is wrong and can reject the file before processing. That way when the client swaps the lastname column with the firstname column without telling us, we can reject the file before importing 100,000 new incorrect records. IN SSIS we can also use fuzzy logic to find existing records to match. So if the record for John Smith says his address is at 213 State st. it can matchour record that says he lives at 215 State Street.

It takes alot to set up a process this way, but once you do, the extra confidence that you are processing good data is worth it's weight in gold.

Even if you can't use SSIS, this should at least give you some ideas of the types of things you should be doing to get the information into your database.

HLGEM
Its Oracle database
Ville M
Then I'd suggest you look up Oracle ETL tools to see what is avaiable to do something similar. For me the key is really deciding what things you need to check before the final step of putting the data into the production tables. The more checks the better off you are. Having the meta data to see what happened with each import is also critical. Having a place to store potential problem records until someone can clear them may also be a good idea depending on your environment. That way you can get the 200,000 good records in and not have the whole import fail because one record has an extra tab.
HLGEM