ansaurus

Question

Answer 1

+3 A:

The join is the problem. Don't do that. Just loop through your current table using some reasonable interval, using the current clustered index. Something like:

Declare @idrange int;
Set @idrange = 1;

WHILE @idrange < 10000000

INSERT INTO Destination
  SELECT *  
  FROM Source
  WHERE DocumentID between @idrange and @idrange + 999
  ORDER BY Source.DocumentID  

Set @idrange = @idrange + 1000
End

Note that for best speed, remove all indexes (including the clustered index) from the destination table, then add the indexes once all rows have been inserted.

EDIT: changed the range interval to prevent an overlap (since BETWEEN includes the endpoints)

One final clarification: the overall point of my sample script is that you simply want to walk through your current records in some reasonable order, and put them all into the new table in batches. There's no reason to continue checking the destination table each time, since you should already know what you've put there, and what is still left. Most of the time, it makes sense to use the clustered index (if there is one), since this means it can walk through the physical order of the table without doing bookmark lookups. If the table has no cluster, then just use whatever makes the most sense (your PK, probably).

BradC 2010-09-13 17:34:56

The primary key on this table is non-clustered. Converting to a clustered index would be almost as painful as recreating the table, because it has to physically re-arrange all the data.

Bryce Wagner 2010-09-13 22:11:51

Doing it by ranges is definitely more efficient than the join method. I'll have to see how it performs in practice, but what I was hoping was that there was some way to execute the inserts in a non-atomic manner, where if it gets stopped partway through, it would leave what's already done, and avoid the extra writing to the log file.

Bryce Wagner 2010-09-13 22:13:16

@Bryce Wagner - So if its not clustered on the PK, then do an equivalent range or grouping on whatever it *is* clustered on. (Please tell me it does have a cluster, and that its not just a HEAP)

BradC 2010-09-13 22:29:28

My WHILE loop *does* do atomic inserts. If you want to be really explicit about it, just put a `BEGIN TRAN` and `COMMIT TRAN` around the insert. If it does get interrupted, you'll have to query the table to find out for sure where it left off, so you can restart it in the right place.

BradC 2010-09-13 22:32:44

It is not clustered, because outside of this situation, we're only ever pulling a single row from this table at a time. And I think you misread my second comment, go back and read it again: I was hoping for a solution that DOESN'T involve all the transaction machinery. I want speed over accuracy (and then do cleanup at the end). But your answer is far better than what I was doing before.

Bryce Wagner 2010-09-14 13:14:11

Small side note: I'm a big fan of a clustered index on every table, (almost) all the time. Most of the time this is also the PK. Otherwise, you essentially have the records in *arbitrary order* on the physical storage. So any lookup, even using the PK, has to do a "bookmark lookup" to get to the physical record. In this case, with no clustered index, then you would definitely want to use the PK as your "range" field.

BradC 2010-09-14 14:04:07

My point regarding the transactions *is* one of speed and efficiency. You want to physically commit each batch insert, so that if a problem occurs, it doesn't roll back the entire operation. My original script already does this implicitly (since SQL defaults to implicit transactions), adding the `TRAN` commands just makes it more obvious what is going on. Whatever you do, you *don't* want a big explicit transaction around the entire script, otherwise it *will* roll back all records if a problem occurs.

BradC 2010-09-14 14:07:35

@Bryce Sorry, one more :). If your DB is in SIMPLE recovery mode, then my sample script shouldn't grow the log file much at all (because it commits every 1000 records). If it is in FULL recovery mode, you may need to increase the frequency of your transaction log backups while you perform this operations, so it can re-use the space for all the committed batches, and the log file doesn't grow out of control. But no, there's no way to prevent SQL from using its log for this operation. That's just the way SQL works.

BradC 2010-09-14 14:22:40

ansaurus

tags:

views:

answers:

Efficient way to alter 100GB table

related questions