views:

1794

answers:

4

My database background is mainly Oracle, but I've recently been helping with some SQL Server work. My group has inherited some SQL server DTS packages that do daily loads and updates of large amounts of data. Currently it is running in SQL Server 2000, but will soon be upgraded to SQL Server 2005 or 2008. The mass updates are running too slowly.

One thing I noticed about the code is that some large updates are done in procedural code in loops, so that each statement only updates a small portion of the table in a single transaction. Is this a sound method to do updates in SQL server? Locking of concurrent sessions should not be an issue because user access to tables is disabled while the bulk loading takes place. I've googled around some, and found some articles suggesting that doing it this way conserves resources, and that resources are released each time an update commits, leading to greater efficiency. In Oracle this is generally a bad approach, and I've used single transactions for very large updates with success in Oracle. Frequent commits slow the process down and use more resources in Oracle.

My question is, for mass updates in SQL Server, is it generally a good practice to use procedural code, and commit many SQL statements, or to use one big statement to do the whole update?

+1  A: 

In general, I find it better to update in batches - typically in the range of between 100 to 1000. It all depends on how your tables are structured: foreign keys? Triggers? Or just updating raw data? You need to experiment to see which scenario works best for you.

If I am in pure SQL, I will do something like this to help manage server resources:

SET ROWCOUNT 1000
WHILE 1=1 BEGIN
    DELETE FROM MyTable WHERE ...
    IF @@ROWCOUNT = 0
     BREAK
END
SET ROWCOUNT 0

In this example, I am purging data. This would only work for an UPDATE if you could restrict or otherwise selectively update rows. (Or only insert xxxx number of rows into an auxiliary table that you can JOIN against.)

But yes, try not to update xx million rows at one time. It takes forever and if an error occurs, all those rows will be rolled back (which takes an additional forever.)

beach
I had no idea you could do this - genius. Thanks for the tip!
rwmnau
I agree that I'll need to do a lot of experimenting. I'll need to measure both the speed of the updates and impact on other database users (accessing other tables).
RussellH
A: 

Well everything depends.

But ... assuming your db is in single user mode or you have table locks (tablockx) against all the tables involved, batches will probably perform worse. Especially if the batches are forcing table scans.

The one caveat is that very complex queries will quite often consume resources on tempdb, if tempdb runs out of space (cause the execution plan required a nasty complicated hash join) you are in deep trouble.

Working in batches is a general practice that is quite often used in SQL Server (when its not in snapshot isolation mode) to increase concurrency and avoid huge transaction rollbacks because of deadlocks (you tend to get deadlock galore when updating a 10 million row table that is active).

Sam Saffron
I was thinking "it depends" was going to be the answer.
RussellH
A: 

When you move to SQL Server 2005 or 2008, you will need to redo all those DTS packages in SSIS. I think you will pleasantly surprised to see how much faster SSIS can be.

In general, In SQL Server 2000, you want to run things in batches of records if the whole set ties up the table for too long. If you are running the packages at night when there is no use on the system, you may be able to get away with a set-based insert of the entire dataset. Row-by-row is always the slowest method, so avoid that if possible as well (Especially if all the row-row-row inserts are in one giant transaction!). If you have 24 hour access with no down time, you will almost certainly need to run in batches.

HLGEM
"Especially ... one giant transaction!" So it looks like really large transactions are trouble in SQL Server. What about in READ COMMITTED SNAPSHOT mode? Does the transaction still use a lot of locks? Is that mode worth looking ar for large batch transactions?
RussellH
"Especially ... one giant transaction!" So it looks like really large transactions are trouble in SQL Server. What about in READ COMMITTED SNAPSHOT mode? Does the transaction still use a lot of locks? Is that mode worth looking for large batch transactions?
RussellH
+2  A: 

Sorry Guys,

None of the above answer the question. They are just examples of how you can do things. The answer is, more resources get used with frequent commits, however, the transaction log cannot be truncated until a commit point. Thus, if your single spanning transaction is very big it will cause the transaction log to grow and possibly fregment which if undetected will cause problems later. Also, in a rollback situation, the duration is generally twice as long as the original transaction. So if your transaction fails after 1/2 hour it will take 1 hour to roll back and you can't stop it :-)

I have worked with SQL Server2000/2005, DB2, ADABAS and the above is true for all. I don't really see how Oracle can work differently.

You could possibly replace the T-SQL with a bcp command and there you can set the batch size without having to code it.

Issuing frequest commits in a single table scan is prefferable to running multiple scans with small processing numbers because generally if a table scan is required the whole table will be scanned even if you only returning a small subset.

Stay away from snapshots. A snapshot will only increase the number of IOs and competes for IO and CPU

Oracle is different in the sense that it uses "undo" and "redo". "undo" is used for rollback and to allow read consistent queries.
RussellH