views:

380

answers:

4

I have a Sql Query which returns me over half million rows to process... The process doesn't take really long, but I would like to speed it up a little bit with some multiprocessing. Considering the code below, is it possible to multithread something like that easily?

using (SqlDataReader reader = command.ExecuteReader())
{
    while (reader.Read())
    {
        // ...process row
    }
}

It would be perfect if I could simply get a cursor at the beginning and in the middle of the list of results. That way, I could have two thread processing the records. However the SqlDataReader doesn't allow me to do that...

Any idea how I could achieve that?

+3  A: 

Set up a producer/consumer queue, with one producer process to pull from the reader and queue records as fast as it can, but do no "processing". Then some other number of processes (how many you want depends on your system) to dequeue and process each queued record.

Joel Coehoorn
Hmm.. odd, but I couldn't find a generic producer/consumer pattern already implemented. Playing with my own now, but input here is appreciated: http://stackoverflow.com/questions/916863/generic-net-produce-consumer
Joel Coehoorn
A: 

Is it a simple ranged query like WHERE Id between 1 and 500000? If so you can just kick off N queries that each return 1/N of the range. But it helps to know where you are bottlenecked with the single threaded approach. If you are doing contiguous reads from one disk spindle to fulfill the query then you should probably stick with a single thread. If it is partitioned across spindles by some range then you can intelligently tune your queries to maximize throughput from disk (i.e. read from each disk in parallel with separate queries). If you expect all of the rows to be in memory then you can parallelize at will. But if the query is more complex then you may not be able to easily partition it without incurring a bunch of overhead. Most of the time the above options will not apply well and the producer/consumer that Joel mentioned will be the only place to parallelize. Depending on how much time you spend processing each row, this may be provide only trivial gains.

A: 

Since no one is going to actually manually read that number of records, why are you pulling them across the network anyway? Can't you use a stored proc to do whatever data manipulation you want directly on the server?

HLGEM
+1  A: 

You shouldn't read that many rows on the client.

That being said, you can partition your query into multiple queries and execute them in parallel. That means launch multiple SqlCommands in separate threads and have them each churn a partition of the result. The A+ question is how to partition the result, and this depends largely o your data and your query:

  1. You can use a range of keys (eg. ID betweem 1 and 10000, ID between 10001 and 20000 etc)
  2. You can use an attribute (eg. RecordTypeID IN (1,2), RecordTypeID IN (3,4) etc)
  3. You can use a synthetic range (ie. ROW_NUMBER() BETWEEN 1 and 1000 etC), but this is very problematic to pull of right
  4. You can use a hash (eg. BINARY_CHECKSUM(*)%10 == 0, BINARY_CHECKSUM(*)%10==1 etc)

You just have to be very careful that the partition queries do no overlap and block during execution (ie. scan the same records and acquire X locks), thus serializing each other.

Remus Rusanu