views:

182

answers:

7

Database : SQL Server 2005

Problem : Copy values from one column to another column in the same table with a billion+ rows.

test_table (int id, bigint bigid)

Things tried 1: update query

update test_table set bigid = id 

fills up the transaction log and rolls back due to lack of transaction log space.

Tried 2 - a procedure on following lines

set nocount on
set rowcount = 500000
while @rowcount > 0
begin
 update test_table set bigid = id where bigid is null
 set @rowcount = @@rowcount
 set @rowupdated = @rowsupdated + @rowcount
end
print @rowsupdated

The above procedure starts slowing down as it proceeds.

Tried 3 - Creating a cursor for update.

generally discouraged in SQL Server documentation and this approach updates one row at a time which is too time consuming.

Is there an approach that can speed up the copying of values from one column to another. Basically I am looking for some 'magic' keyword or logic that will allow the update query to rip through the billion rows half a million at a time sequentially.

Any hints, pointers will be much appreciated.

+2  A: 

You could try to use something like SET ROWCOUNT and do batch updates:

SET ROWCOUNT 5000;

UPDATE dbo.test_table 
SET bigid = id 
WHERE bigid IS NULL
GO

and then repeat this as many times as you need to.

This way, you're avoiding the RBAR (row-by-agonizing-row) symptoms of cursors and while loops, and yet, you don't unnecessarily fill up your transaction log.

Of course, in between runs, you'd have to do backups (especially of your log) to keep its size within reasonable limits.

marc_s
Will rowcount work on an UPDATE command? - http://msdn.microsoft.com/en-us/library/ms188774.aspx
Dan Williams
Yes, it works up to and including SQL 2008, won't work in SQL vNext. But you are correct, `UPDATE TOP 5000 ...` is the preferred syntax for SQL 2005 and up. See http://msdn.microsoft.com/en-us/library/ms177523.aspx
BradC
Love the RBAR acronym. I'll have to steal that.
BradC
@BradC : I think it also works in SQL Server 2008 R2 - no??
marc_s
@marc_s Yes, I was including R2 under the umbrella of "2008". They wouldn't deprecate features between major numbered revisions.
BradC
Why would this not work in SQL vNext (2011/2012/whenever)??? Plus vNext will be a major numbered version ... which means, according to your later comment, that this 'feature' wouldn't be deprecated.
Jess
@Jess: See the link in Dan's first comment: "Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in the next release of SQL Server. Do not use SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. Also, for DELETE, INSERT, and UPDATE statements that currently use SET ROWCOUNT, we recommend that you rewrite them to use the TOP syntax."
BradC
@BradC - thanks for the heads-up, that seems really strange that they'd discontinue its usage.
Jess
A: 

I didn't run this to try it, but if you can get it to update 500k at a time I think you're moving in the right direction.

set rowcount 500000
update test_table tt1
set bigid = (SELECT tt2.id FROM test_table tt2 WHERE tt1.id = tt2.id)
where bigid IS NULL

You can also try changing the recover model so you don't log the transactions

ALTER DATABASE db1
SET RECOVERY SIMPLE
GO

update test_table
set bigid = id
GO

ALTER DATABASE db1
SET RECOVERY FULL
GO
Dan Williams
Wouldn't recommend the subquery with the join. Too much unnecessary work on this huge a table when the alternative (`bigid = id`) is so simple.
BradC
stupid me :) Thanks for the catch.
Dan Williams
"The setting of the SET ROWCOUNT option is ignored for INSERT, UPDATE, and DELETE statements against remote tables and local and remote partitioned views." from MSDN. I don't think the rowcount will work with the update command, so the select is required to make the rowcount work.
Dan Williams
We aren't dealing with remote tables or partitioned views, so that warning doesn't apply. But yes, `UPDATE TOP x` is probably preferred vs `SET ROWCOUNT`
BradC
+2  A: 

Is this a one time thing? If so, just do it by ranges:

set counter = 500000
while @counter < 2000000000 --or whatever your max id
begin
 update test_table set bigid = id where id between (@counter - 500000) and @counter and bigid is null
 set counter = @counter + 500000
end
Jess
Will sure this out too. Whichever is efficient 'where bigid is not null' or 'where id between x and y' wins.
Adi Pandit
+1 to this (and see my similar answer). This takes advantage of the existing PK on ID, and as such shouldn't slow down as it progresses. (as long as you get rid of the `BIGID IS NULL`)
BradC
A: 

First step, if there are any, would be to drop indexes before the operation. This is probably what is causing the speed degrade with time.

The other option, a little outside the box thinking...can you express the update in such a way that you could materialize the column values in a select? If you can do this then you could create what amounts to a NEW table using SELECT INTO which is a minimally logged operation (assuming in 2005 that you are set to a recovery model of SIMPLE or BULK LOGGED). This would be pretty fast and then you can drop the old table, rename this table to to old table name and recreate any indexes.

select id, CAST(id as bigint) bigid into test_table_temp from test_table
drop table test_table
exec sp_rename 'test_table_temp', 'test_table'
keithwarren7
+5  A: 

Use TOP in the UPDATE statement:

UPDATE TOP (@row_limit) dbo.test_table
   SET bigid = id 
 WHERE bigid IS NULL
OMG Ponies
A: 

I second the UPDATE TOP(X) statement

Also to suggest, if you're in a loop, add in some WAITFOR delay or COMMIT between, to allow other processes some time to use the table if needed vs. blocking forever until all the updates are completed

jerryhung
+2  A: 

I'm going to guess that you are closing in on the 2.1billion limit of an INT datatype on an artificial key for a column. Yes, that's a pain. Much easier to fix before the fact than after you've actually hit that limit and production is shut down while you are trying to fix it :)

Anyway, several of the ideas here will work. Let's talk about speed, efficiency, indexes, and log size, though.

Log Growth

The log blew up originally because it was trying to commit all 2b rows at once. The suggestions in other posts for "chunking it up" will work, but that may not totally resolve the log issue.

If the database is in SIMPLE mode, you'll be fine (the log will re-use itself after each batch). If the database is in FULL or BULK_LOGGED recovery mode, you'll have to run log backups frequently during the running of your operation so that SQL can re-use the log space. This might mean increasing the frequency of the backups during this time, or just monitoring the log usage while running.

Indexes and Speed

ALL of the where bigid is null answers will slow down as the table is populated, because there is (presumably) no index on the new BIGID field. You could, (of course) just add an index on BIGID, but I'm not convinced that is the right answer.

The key (pun intended) is my assumption that the original ID field is probably the primary key, or the clustered index, or both. In that case, lets take advantage of that fact, and do a variation of Jess' idea:

set @counter = 1
while @counter < 2000000000 --or whatever
begin
  update test_table set bigid = id 
  where id between @counter and (@counter + 499999) --BETWEEN is inclusive
  set @counter = @counter + 500000
end

This should be extremely fast, because of the existing indexes on ID.

The ISNULL check really wasn't necessary anyway, neither is my (-1) on the interval. If we duplicate some rows between calls, that's not a big deal.

BradC
@BradC I kicked off a script similar to yours with some modification as i have ids from -2B to +2B. My expectation was that the transaction log will not grow by large scale as I am updating 500K rows max in a batch. The transaction log is set up to grow incrementally by 10G upto 100G. The log just hit max and I see the query rolling back. I even put a begin tran and commit around the update statement to make sure that the transaction is incremental. Or could that begin tran and commit in loop might be causing the transaction log to grow?
Adi Pandit
@BradC looks like the begin tran and commit that I had added were causing the transaction log filling up. Removed those from my script and the procedure has been ripping through the rows crisply overnight without filling up the transaction log :-)
Adi Pandit
@Adi: As long as you have the right syntax (`BEGIN TRAN` and `COMMIT TRAN`), then putting those on the *inner* loop should be the same as my script (SQL uses implicit transactions by default, which is why batching like this works). I once put just `COMMIT` instead of `COMMIT TRAN` and ended up with crazy nested transactions that blew up the log like you said. Just don't put the transaction around the *outside* of the loop.
BradC
Thanks @BradC accepting your solution as the best answer as it has the best constant performance and uses the existing index well. Just had to modify the script a bit as my field has values from -2B to +2B range and also adjust the counter values for the last batch to avoid overflow.
Adi Pandit