I challenge you :)
I have a process that someone already implemented. I will try to describe the requirements, and I was hoping I could get some input to the "best way" to do this.
It's for a financial institution.
I have a routing framework that will allow me to recieve files and send requests to other systems. I have a database I can use as I wish but it is only me and my software that has access to this database.
The facts
- Via the routing framework I recieve a file.
- Each line in this file follows a fixed length format with the identification of a person and an amount (+ lots of other stuff).
- This file is 99% of the time im below 100MB ( around 800bytes per line, ie 2,2mb = 2600lines)
- Once a year we have 1-3 gb of data instead.
- Running on an "appserver"
- I can fork subprocesses as I like. (within reason)
- I can not ensure consistency when running for more than two days. subprocesses may die, connection to db/framework might be lost, files might move
- I can NOT send reliable messages via the framework. The call is synchronus, so I must wait for the answer.
- It's possible/likely that sending these getPerson request will crash my "process" when sending LOTS.
- We're using java.
Requirements
- I must return a file with all the data + I must add some more info for somelines. (about 25-50% of the lines : 25.000 at least)
- This info I can only get by doing a getPerson request via the framework to another system. One per person. Takes between 200 and 400msec.
- It must be able to complete within two days
Nice to have
- Checkpointing. If im going to run for a long time I sure would like to be able to restart the process without starting from the top. ...
How would you design this? I will later add the current "hack" and my brief idea
========== Current solution ================
It's running on BEA/Oracle Weblogic Integration, not by choice but by definition
When the file is received each line is read into a database with
id, line, status,batchfilenameand status 'Needs processing'
When all lines is in the database the rows are seperated by mod 4 and a process is started per each quarter of the rows and each line that needs it is enriched by the getPerson call and status is set to 'Processed'. (38.0000 in the current batch).
When all 4 quaters of the rows has been Processed a writer process startes by select 100 rows from that database, writing them to file and updating their status to 'Written'. When all is done the new file is handed back to the routing framework, and a "im done" email is sent to the operations crew.
The 4 processing processes can/will fail so its possible to restart them with a http get to a servlet on WLI.