views:

436

answers:

4

Dear Boffins

I got a large conversion job- 299Gb of JPEG images, already in the database, into thumbnail equivalents for reporting and bandwidth purposes.

I've written a thread safe SQLCLR function to do the business of re-sampling the images, lovely job.

Problem is, when I execute it in an UPDATE statement (from the PhotoData field to the ThumbData field), this executes linearly to prevent race conditions, using only one processor to resample the images.

So, how would I best utilise the 12 cores and phat raid setup this database machine has? Is it to use a subquery in the FROM clause of the update statement? Is this all that is required to enable parallelism on this kind of operation?

Anyway the operation is split into batches, around 4000 images per batch (in a windowed query of about 391k images), this machine has plenty of resources to burn.

+2  A: 

Hi

Please check the configuration setting for Maximum Degree of Parallelism (MAXDOP) on your SQL Server. You can also set the value of MAXDOP.

This link might be useful to you http://www.mssqltips.com/tip.asp?tip=1047

cheers

Andriyev
Thanks, the server is configured for maximum possible parallelism.
Rabid
+2  A: 

Could you not split the query into batches, and execute each batch separately on a separate connection? SQL server only uses parallelism in a query when it feels like it, and although you can stop it, or even encourage it (a little) by changing the cost threshold for parallelism option to O, but I think its pretty hit and miss.

One thing thats worth noting is that it will only decide whether or not to use parallelism at the time that the query is compiled. Also, if the query is compiled at a time when the CPU load is higher, SQL server is less likely to consider parallelism.

Kragen
I have come across this solution but ultimately impractical for the current scenario, this is a one off only and executed through the management studio.
Rabid
+2  A: 

Find some criteria to break the set into distinct sub-sets of rows (1-100, 101-200, whatever) and then call your update statement from multiple connections at the same time, where each connection handles one subset of rows in the table. All the connections should run in parallel.

onupdatecascade
+1  A: 

I too recommend the "round-robin" methodology advocated by kragen2uk and onupdatecascade (I'm voting them up). I know I've read something irritating about CLR routines and SQL paralellism, but I forget what it was just now... but I think they don't play well together.

The bit I've done in the past on similar tasks it to set up a table listing each batch of work to be done. For each connection you fire up, it goes to this table, gest the next batch, marks it as being processed, processes it, updates it as Done, and repeats. This allows you to gauge performance, manage scaling, allow stops and restarts without having to start over, and gives you something to show how complete the task is (let alone show that it's actually doing anything).

Philip Kelley
That is a good technique. I've done it with a global temp table (plus care in using transactions to fetch the rows out of the table, so one thread can't pick up a row already being worked on by another thread)
onupdatecascade
I've tried scouring for references of SQL CLR functions operating within parallel environment and have found nothing myself- I would like to imagine it's a safe bet and that it 'just works'.Definitely a nice approach.
Rabid
I tried my subquery, but it didn't seem to operate in parallel. Your solution is the most elegant though, I've given points to the others for thoughts. I've already converted the data once (took 26 hours), I thought I was going to have to convert it again because of a false alarm through testing.
Rabid