views:

214

answers:

5

I challenge you :)

I have a process that someone already implemented. I will try to describe the requirements, and I was hoping I could get some input to the "best way" to do this.


It's for a financial institution.

I have a routing framework that will allow me to recieve files and send requests to other systems. I have a database I can use as I wish but it is only me and my software that has access to this database.

The facts

  • Via the routing framework I recieve a file.
  • Each line in this file follows a fixed length format with the identification of a person and an amount (+ lots of other stuff).
  • This file is 99% of the time im below 100MB ( around 800bytes per line, ie 2,2mb = 2600lines)
  • Once a year we have 1-3 gb of data instead.
  • Running on an "appserver"
  • I can fork subprocesses as I like. (within reason)
  • I can not ensure consistency when running for more than two days. subprocesses may die, connection to db/framework might be lost, files might move
  • I can NOT send reliable messages via the framework. The call is synchronus, so I must wait for the answer.
    • It's possible/likely that sending these getPerson request will crash my "process" when sending LOTS.
  • We're using java.


Requirements

  • I must return a file with all the data + I must add some more info for somelines. (about 25-50% of the lines : 25.000 at least)
  • This info I can only get by doing a getPerson request via the framework to another system. One per person. Takes between 200 and 400msec.
  • It must be able to complete within two days

Nice to have

  • Checkpointing. If im going to run for a long time I sure would like to be able to restart the process without starting from the top. ...

How would you design this? I will later add the current "hack" and my brief idea

========== Current solution ================

It's running on BEA/Oracle Weblogic Integration, not by choice but by definition

When the file is received each line is read into a database with

id, line, status,batchfilename
and status 'Needs processing'

When all lines is in the database the rows are seperated by mod 4 and a process is started per each quarter of the rows and each line that needs it is enriched by the getPerson call and status is set to 'Processed'. (38.0000 in the current batch).

When all 4 quaters of the rows has been Processed a writer process startes by select 100 rows from that database, writing them to file and updating their status to 'Written'. When all is done the new file is handed back to the routing framework, and a "im done" email is sent to the operations crew.

The 4 processing processes can/will fail so its possible to restart them with a http get to a servlet on WLI.

+1  A: 

I would design this for money.

Steven A. Lowe
+1  A: 

Nice try. But!, if you happen to have a question during your design, don't hesitate in asking.

OscarRyz
+1  A: 

When you receive the file, parse it and put the information in the database.

Make one table with a record per line that will need a getPerson request.

Have one or more threads get records from this table, perform the request and put the completed record back in the table.

Once all records are processed, generate the complete file and return it.

Rasmus Faber
Pretty acurate of the current solution:)
svrist
+1  A: 

if the processing of the file takes 2 days, then I would start by implementing some sort of resume feature. Split the large file into smaller ones and process them one by one. If for some reason the whole processing should be interrupted, then you will not have to start all over again.

By splitting the larger file into smaller files then you could also use more servers to process the files.

You could also use some mass loader(Oracles SQL Loader for example) to take the large amount of data form the file into the table, again adding a column to mark if the line has been processed, so you can pick up where you left off if the process should crash.

The return value could be many small files which at the end would be combined into large single file. If the database approach is chosen you could also save the results in a table which could then be extracted to a csv file.

nchris
hRight on! As I will soon sketch my thoughts went along Map/Reduce approach with splitting the files up in "Work Units". As @le dorfier points out its the line that's the atomic unit and the order of the lines is not significant.
svrist
+4  A: 

Simplify as much as possible.

The batches (trying to process them as units, and their various sizes) appear to be discardable in terms of the simplest process. It sounds like the rows are atomic, not the batches.

Feed all the lines as separate atomic transactions through an asynchronous FIFO message queue, with a good mechanism for detecting (and appropriately logging and routing failures). Then you can deal with the problems strictly on an exception basis. (A queue table in your database can probably work.)

Maintain batch identity only with a column in the message record, and summarize batches by that means however you need, whenever you need.

le dorfier
Nice insight! Thats pretty good deduction from my handwaving description.
svrist