views:

73

answers:

2

I have a system which breaks a large taks into small tasks using about 30 threads as a time. As each individual thread finishes it persists its calculated results to the database. What I want to achieve is to have each thread pass its results to a new persisance class that will perform a type of double buffering and data persistance while running in its own thread.

For example, after 100 threads have moved their data to the buffer the persistance class then the persistance class swaps the buffers and persists all 100 entries to the database. This would allow utilization of prepared statements and thus cut way down on the I/O between the program and the database.

Is there a pattern or good example of this type of multithreading double buffering?

+3  A: 

I've seen this pattern referred to as asynchronous database writing or the write behind pattern. It's a typical pattern supported by the distributed cache products (Teracotta, Coherence, GigaSpaces, ...) because you don't want your cache updates to also include writing the change to the underlying database.

The complexity of this pattern depends on your tolerance for lost database updates. Because of the delay between completing the work and writing the result to the database, you can lose the updates due to bugs, power failures, ... (you get the picture).

I'd suggest some sort of queue for the completed results to be written to the DB and then process them in batches of 100 (using your example) OR after an amount of time. The reason for also using a time delay is to cope with result sets that aren't divisible by 100.

If you have no requirements for resilience/durability, then you can do all this in the same process. If, however, you can't tolerate any loss, then you can replace the in-vm queue with a persistent JMS queue (slower but safer).

hbunny
+1 for flagging possible issues.
Romain Hippeau
Its an overnight batch process so if there was enough memory it would be fine if the process waited until the very end to write all the generated data to the database. There isn't enough memory to wait until the end so I'm planning on setting it up so that it will persist to the db after a certain number of threads have passed in their data.
Winter
+1  A: 

In order to have lower synchronization overhead, use a thread local (for each compute thread) to build up batches of results. Once some number of results is reached, enqueue the batch to a blocking queue. Use an ArrayBlockingQueue to back your persistence class, since you probably don't want your memory usage to become unbounded. You can have multiple database writer threads taking groups of results and saving them to the database.

class WriteBehindPersister {
 ThreadLocal<List<Result>> internalBuffer;
 static ArrayBlockingQueue<List<Result>> persistQueue;
 static {
   persistQueue = new ArrayBlockingQueue(10);
   new WriteThread().start();
 }    

 public WriteBehindPersister() {
  internalBuffer = new ThreadLocal<List<Result>>();
 }

 public void persist(Result r) {
  List<Result> localResult = internalBuffer.get();
  localResult.add(r);
  if (localResult.size() > max) {
   persistQueue.put(new ArrayList(localResult));
   localResult.clear();
  }
 }

 class WriteThread extends Thread {
  public void run() {
   while (true) {
    List<Result> batch = persistQueue.take();
    beginTransaction();
    for (Result r : batch) {
     batchInsert(r);
    }
    endTransaction();
   }
  }
 }

}

Also, you might use an executor service (instead of a single write thread) to persist multiple batches to the DB simultaneously, at the tradeoff of using more than one DB connection. Make sure to use the JDBC batching API if your driver supports it.

Justin
as steven points out, you need to decide on how to flush the queue at the end of the computation (or if there have been no requests in a long period of time). It all depends on how 'on-line' you need to be.
Justin
Would each worker thread have its own WriteBehindPersister or would the WriteBehindPersister be a singleton?
Winter
The pattern works as a singleton since each thread has its own ThreadLocal internal buffer. If you don't want to use the threadlocal stuff you can instantiate one Persister per thread (with its own buffer) and replace the static queue with an injected reference to a shared queue.
Justin