views:

78

answers:

4

I have been working with SQL server for a while and have used lot of performance techniques to fine tune many queries. Most of these queries were to be executed within few seconds or may be minutes.

I am working with a job which loads around 100K of data and runs for around 10 hrs.

What are the things I need to consider while writing or tuning such query? (e.g. memory, log size, other things)

+3  A: 

Make sure you have good indexes defined on the columns you are querying on.

TLiebe
I have indexes on all related tables. Queries running individually work fine, it gives trouble only when executed in bulk.
BinaryHacker
Are the indexes still up to date? Have a look at the index properties in Server Management Studio and look at the Fragmentation tab. Try rebuilding the indexes for those that are too fragmented.
TLiebe
Thanks TLiebe. I did check that, and will check it again. But right now, the only thing that is bugging me is, whats the difference between when processing for 10K and when processing for 100K. If indexes are not updated, I should get bad performance when I am processing 10K records also.
BinaryHacker
Try having a look at the execution plan of the query. Does something change when you run 10K records vs. 100K? Is one step all of a sudden taking a lot higher percentage of the time? Compare the 10K run vs. the 100K run and you might be able to spot what's going wrong with the 100K run and focus your efforts on fixing that.
TLiebe
The execution plan _is_ going to vary with size of tables, so that's not unexpected.
Cade Roux
I believe execution plan will give me estimated execution time for the query irrespective of no. of records. How do I compare 10K run with 100K run? all I know is that its not any particular table or query which becomes slow. It's the whole job. Somehow my job is eating some critical system resources, which I unable to find or prevent from happening.
BinaryHacker
"execution plan will give me estimated execution time", no, it gives a plan and relative time of components - relative time will often not change with scale unless the plan changes altogether. Scale CAN change the plan completely. "my job is eating some critical system resources" you may have hit some kind of memory threshold - this should be more clear from looking at the plan - bad joins, cross joins can all balloon at certain thresholds.
Cade Roux
@Cade: Thanks for correcting me there....Does execution plan works for levels of procedure execution? I have a top level procedure calling multiple procedures. Also, I can run estimated execution plan, but considering the time and load requirement of this job, I may not be able to execute it again with execution plan enabled without making some performance changing or knowing for sure. I will recheck all joins once again as you suggested.
BinaryHacker
+1  A: 

Ultimately, the best thing to do is to actually measure and find the source of your bottlenecks. Figure out which queries in a stored procedure or what operations in your code take the longest, and focus on slimming those down, first.

I am actually working on a similar problem right now, on a job that performs complex business logic in Java for a large number of database records. I've found that the key is to process records in batches, and make as much of the logic as possible operate on a batch instead of operating on a single record. This minimizes roundtrips to the database, and causes certain queries to be much more efficient than when I run them for one record at a time. Limiting the batch size prevents the server from running out of memory when working on the Java side. Since I am using Hibernate, I also call session.clear() after every batch, to prevent the session from keeping copies of objects I no longer need from previous batches.

Also, an RDBMS is optimized for working with large sets of data; use normal SQL operations whenever possible. Avoid things like cursors, and a lot procedural programming; as other people have said, make sure you have your indexes set up correctly.

RMorrisey
Two more suggestions: look at the query execution plan in SQL Server Management Studio, and look for table scans that you can eliminate with proper indexing, making sure your queries are sargable. If you are working with very large tables, try defragmenting your indexes. See: http://updates.sqlservervideos.com/2009/09/power-up-with-sql-server-sql-server-performance.html
RMorrisey
+1  A: 

It's impossible to say without looking at the query. Just because you have indexes doesn't mean they are being used. You'll have to look at the execution plan and see if they are being used. They might show that they aren't useful to the execution plan.

You can start with looking at the estimated execution plan. If the job actually completes, you can wait for the actual execution plan. Look at parameter sniffing. Also, I had an extremely odd case on SQL Server 2005 where

SELECT * FROM l LEFT JOIN r ON r.ID = l.ID WHERE r.ID IS NULL

would not complete, yet

SELECT * FROM l WHERE l.ID NOT IN (SELECT r.ID FROM r)

worked fine - but only for particular tables. Problem was never resolved.

Make sure your statistics are up to date.

Cade Roux
A: 

If possible post your query here so there is something to look at. I recall a query someone built with joins to 12 different tables dealing with around 4 or so million records that took around a day to run. I was able to tune that to run within 30 mins by eliminating the unnecessary joins. Where possible try to reduce the datasets you are joining before returning your results. Use plenty of temp tables, views etc if you need.

In cases of large datasets with conditions try to preapply your conditions through a view before your joins to reduce the number of records. 100k joining 100k is a lot bigger than 2k joining 3k