views:

56

answers:

3

I have a project where I need to pull in a lot of records and modify them based on some criteria.

Basically, we have a system where users can upload documents. Those documents get tagged in the database for validation reasons from other users. We validate the files based on certain criteria and then mark them as being valid. So we have 2 columns isValid and validated.

I can’t depend on the database to validate the files so I have an application that does some work to validate them. There could potentially be hundreds of thousands of files to validate. What is the best approach for the application to iterate the database? One thought I had, was to write a SP to pull the TOP X amount of records that do not have the validated flag set to true. Then run another query to see if there are still records left. If so, run that same SP again, pull the records and process them. I am not sure how the application would handle that amount of records.

A: 

I would start by looking at BulkRead and BulkWrite against the db. I haven't personally had a reason to use them, but I believe they'll be pretty close to exactly what you need: a very fast way to pull data from the db, and then a very fast way to write back to the db as well.

AllenG
Pretty exactly NOT what the user needs. Gratulations.
TomTom
+1  A: 

Have you tried using FILESTREAM columns in SQL Server? If not, here is a brief description.

Essentially, this way your documents could be physically stored in the file system, yet still treated by SQL Server as an integral part of your DB -- meaning you would not have to update records with large BLOB columns and/or you would be able to use direct file system calls to manage the documents themselves.

Just a thought.

Alan
+2  A: 

Your approach is pretty sound. I have used a similar approach for example for mass mailings (read top 1000, do until you run out of records). Th good thing is that you never have to pull in more than X records, which keeps your loops nice fast.

If it does not work out, you could add Service Broker and put in a QUEUE where you add validation orders that processes listen to. This later approach allows you to actually also have easily multiple readers that do the validation. This only makes sense, though, if validation is a bottleneck (becuase it takes time - you never say what validating actually DOES).

TomTom
Processing like this is basically a queue. If high performance/throughput is needed, the table has to be designed with queueing in mind: http://rusanu.com/2010/03/26/using-tables-as-queues/
Remus Rusanu
The problem with this is that without an update you can not easily put in multiple readers. So, it is fine for something that has not a high processing requirement, but running 300 processes pulling the newest items is a little more complicated than just a select top.
TomTom