views:

279

answers:

1

Hi,

Currently we have a batch driven process at work which runs every 15 mins and everytime it runs it repeats this cycle several times:

  1. Calls a sproc and get some data back from the DB
  2. Process the data
  3. Saves the result back to the DB

It can't load all the data in one go because the data are segregated by a number of fields and each group of data requires different behaviour during processing (configurable from a front end). However, recent changes in the business has resulted in a sudden surge in the volume of data (and therefore the processing time required) for some of the groups, so now whenever one of the groups overruns it delays all the other groups.

Our plan is to parallelise this process across multiple machines so that:

  • there is a central controller (master) and several workstations (slaves)
  • master is responsible for scheduling the runs (configurable from a front end)
  • master (or a separate component) is responsible for loading/saving data to and from the DB (in order to avoid deadlocks/contention between the multiple slaves)
  • slaves receive work items, process them and return the results to master
  • there is a primary slave (main production server in our environment) which will usually receive all the work items
  • secondary slaves will receive work only if the primary slave is working on a group which requires longer processing time (master can identify this based on the size of the data returned or it can be left to configuration)
  • if slave throws exception during processing, alert email is sent to support team, and the same work item is picked up during the next schedule cycle
  • not sure what to do with timeouts yet

I have done some research on the Master-Slave pattern for distributed environment but so far haven't found many reference material, does anyone here know of a good implementation of such pattern? Any pointers on potential pitfalls of such an architecture would be much appreciated too!

Thanks,

A: 

Your Master/Slave design above seems to imply that the writes to the database will be serialised anyway, so have you considered simply running multiple copies of your current process in parallel (e.g. by forking a new process for each job) and managing contention via a shared application lock?

Chris Card
yup, that was one of the solutions we considered, but there will be serious limitations on how far we can go with this approach as the main production box is also used by other business critical, processing intensive apps in our world. the problem here is that our process maxes out one of the CPUs for pretty much the duration of the processing because of the number crunching it has to do and if we go multi-threading on the same box there's a good chance that we will start affecting the performance of other more critical systems
theburningmonk
oh, and yes, the writes to the DB will be serialised, but that's not a problem for us as most of the time are being spent processing the data so the serialised DB access shouldn't prove a bottleneck
theburningmonk
you could have a queue of work in the database and a pool of machines picking up the work.
Chris Card