views:

36

answers:

1

I have about 60Million records in database and have to process all of them. So the idea is to use c# code to read data, process it and then put it back in the db. Data doesn't come and go to the same table - multiple tables are involved.

I want to see what's the best to go about doing that? Should I read 100K records at a time in dataset and then process each record and then use bulk insert to database and then read next set?

Any help would be appreciated.

thanks

+1  A: 

Typically the absolute fastest way is to do everything on the server in SQL batches.

If you insist on using a client then separate threads to read and write can be faster than using one to do both. How many threads to read and write will depend on the hardware and what you're doing

EDIT: Clarifying the approach.

Retrieving and sending the data to the sql server is both network IO bound and out of process. This means that on both reading and sending the data your application spends time waiting for data to go from disk over the network and into memory. Lets assume that it would take 1 hour to retrieve the data. 10 minutes to process and 1 hour to send the data back to the db. So your entire process would take 2 hours and 10 minutes.

If you split it up into three threads, 1 reader, 1 processor, 1 updater. You can get it down to close to 1 hour. If you write your application well you can add additional threads for reading, processing and writing but you might be disappointed by the results because of things like sharing of cache lines, how the network card responds to lots of concurrent requests etc.

Also when you use a DataAdapter to fill a dataset you can't touch any of the data until the fill is complete. If you a DataReader on the other hand you can start using the data when the first row is complete. This means you don't have to worry about limiting to 100K at a time.

Conrad Frix
+1 for separate threads (though every thread doing read/process/write is fine, I think). Also, don't use LINQ to SQL. It really suffers with memory leaks on huge datasets.
Nestor
No, I cannot do everything in sql server. I have to call third party to process the data.So, reading 100K rows in dataset and process and put the data back and do all this in 10 threads is the fastest way?
motiont.com