views:

38

answers:

2

I have a bunch of files (on the order of 10 per second) coming into a system (stored into a database). Each file contains an entry for somewhere between 1 and 500 devices. A given device will appear in multiple files (but not every file). This data eventually needs to be stored in another database, stored per-device. There are two different file formats.

There is an API that takes care of the final database part, which takes several entries for a single device (behind the scenes, this also does some lookups to find IDs in the database, and so processing multiple entries at once for a single device means doing the lookups once, instead of once per entry).

To do this, I have a program with several parts:

  • Parse files, extracting data into a common set of data objects.
    • This is a threaded process, with one thread per file, adding data into a thread-safe collection
    • As each file is loaded, its DB entry is marked as being 'in progress'
  • Save objects into database
    • Another threaded process, which extracts all the objects for a given device, and then tells the data API to save them.
    • Once the save for all devices from a single file are successful (or if any fail) the DB entry for the original file is marked as being success/failed

My question is: what is the best way to manage when to parse files, how many threads to use, how much RAM etc?

  • The data API is going to take the longest - most of the time, the threads there will just be waiting for the API to return.
  • The overall efficiency of the system is improved by having more data grouped per device
  • The application shouldn't run out of RAM, or have so many files parsed but waiting to be saved that it causes the OS to swap.
  • It's unknown how many simultaneous calls the DB API can handle, or how quickly it runs - this process needs to adapt to that

So how do I know when to parse files to make sure that this is going as fast as it can, without causing a performance hit by using too much RAM?

A: 

This is how I would do it. As each new file comes in, add it to a queue. Have a dispatcher pick up a file and start a new thread.

The dispatcher can constantly monitor available system memory and cpu usage (using for example the performance counter api).

As long as there is enough free memory or low enough cpu load, launch a new thread. You would have to test a bit to find the optimal thresholds for your application.

Also, if you are running on 32bit, then one process can only use around ~800mb of ram before you get an out of memory exception, so you might need to take that into consideration as well.

Your third factor for starting new work is the DB API. As long as it can swallow your added work, keep on adding more threads.

The flow of the program would be something like this:

  1. Consume and parse files
  2. When reaching your memory limit (and/or cpu limit), batch them to the DB API
  3. As you batch to the DB API, memory is released, and new files can be processed - goto 1
Mikael Svenson
+1  A: 

It seems like you have a system that is very much I/O bound (files on he input side and DB on the output side). I don't see any CPU intensive parts in there.

The obvious optimization is already in the question: bunch a whole lot of incoming files and group the data per device. The cost is memory consumption and latency in Db updates. You'll need parameters for that.

As a first idea, I would set it up in 3 blocks connected by bounded-queues. Those queues will let any component that is 'overwhelmed' throttle its suppliers.

block 1: 1 or 2 threads (depends on I/O system) to read and parse files,

block 2: 1 thread to organize and group data. Decide when device-data should go to the Db

block 3: 1+ threads pushing data to the Db.

The blocks give this system some flexibility. The limited queues let you control resource consumption. Note that block 2 should be parametrized to tune block-size.

Henk Holterman
I like this direction as it keeps the interaction between threads at a minimum. Depending on how it is implemented you can easily run tests to see how adding threads to each portion improves or degrades performance.
ChaosPandion