views:

213

answers:

2

Hi all,

I have an application that is receiving a high volume of data that I want to store in a database. My current strategy is to fire off an asynchronous call (BeginExecuteNonQuery) with each record when it's ready. I'm using the asynchronous call to ensure that the rest of the application runs smoothly.

The problem I have is that as the volume of data increases, eventually I get to the point where I'm trying to fire a command down the connection while it's still in use. I can see two possible options:

  1. Buffer the pending data myself until the existing command is finished.
  2. Open multiple connections as needed.

I'm not sure which of these options is best, or if in fact there is a better way. Option 1 will probably lead to my buffer getting bigger and bigger, while option 2 may be very bad form - I just don't know.

Any help would be appreciated.

+3  A: 

Depending on your locking strategy, it may be worth using several connections but certainly not a number "without upper bounds". So a good strategy/pattern to use here is "thread pool", with each of N dedicated threads holding a connection and picking up write requests as the requests come and the thread finishes the previous one it was doing. Number of threads in the pool for best performance is best determined empirically, by benchmarking various possibilities in a realistic experimental/prototype setting.

If the "buffer" queue (in which your main thread queues write requests and the dedicated threads in the pool picks them up) grows beyond a certain threshold, it means you're getting data faster than you can possibly write it out, so, unless you can get more resources, you'll simply have to drop some of the incoming data -- maybe by a random-sampling strategy to avoid biasing future statistical analysis. Just count how much you're writing and how much you're having to drop due to the resource shortage in each period of time (say every minute or so), so you can use "stratified sampling" techniques in future data-mining explorations.

Alex Martelli
A: 

Thanks Alex - so you'd suggest a hybrid method then, assuming that I'll still need to buffer updates if all connections are in use?

(I'm the original poster, I've just managed to get two accounts without realizing)

George Sealy
Yes, if the data's coming in "fast and furious", some of it will need buffering -- but, use a buffer of finite size, just like you use a pool with a finite number of threads (one connection per thread is my recommendation) -- and diagnose when the incoming data is just too much for you to ever write, then "subsample" (toss away a random set of data you can't handle, count how much you're handling and how much you're tossing every minute, record those two numbers for posterity).
Alex Martelli
The buffer (queue) lets you survive a relatively short spike of incoming data (more than you can write in a given _second_ -- but you could eventually write it if the data rate slacks down a bit in the short-term future).
Alex Martelli