views:

205

answers:

6

I have a very large database (~100Gb) primarily consisting of two tables I want to reduce in size (both of which have approx. 50 million records). I have an archive DB set up on the same server with these two tables, using the same schema. I'm trying to determine the best conceptual way of going about removing the rows from the live db and inserting them in the archive DB. In pseudocode this is what I'm doing now:

Declare @NextIDs Table(UniqueID)
Declare @twoYearsAgo = two years from today's date

Insert into @NextIDs 
     SELECT top 100 from myLargeTable Where myLargeTable.actionDate < twoYearsAgo

Insert into myArchiveTable
<fields>
SELECT <fields> 
FROM myLargeTable INNER JOIN @NextIDs on myLargeTable.UniqueID = @NextIDs.UniqueID

DELETE MyLargeTable
FROM MyLargeTable INNER JOIN @NextIDs on myLargeTable.UniqueID = @NextIDs.UniqueID

Right now this takes a horrifically slow 7 minutes to complete 1000 records. I've tested the Delete and the Insert, both taking approx. 3.5 minutes to complete, so its not necessarily one is drastically more inefficient than the other. Can anyone point out some optimization ideas in this?

Thanks!

This is SQL Server 2000.

Edit: On the large table there is a clustered index on the ActionDate field. There are two other indexes, but neither are referenced in any of the queries. The Archive table has no indexes. On my test server, this is the only query hitting the SQL Server, so it should have plenty of processing power.

Code (this does a loop in batches of 1000 records at a time):

 DECLARE @NextIDs TABLE(UniqueID int primary key)
DECLARE @TwoYearsAgo datetime
SELECT @TwoYearsAgo = DATEADD(d, (-2 * 365), GetDate())

WHILE EXISTS(SELECT TOP 1 UserName FROM [ISAdminDB].[dbo].[UserUnitAudit] WHERE [ActionDateTime] < @TwoYearsAgo)
BEGIN

BEGIN TRAN

--get all records to be archived
INSERT INTO @NextIDs(UniqueID)
        SELECT TOP 1000 UniqueID FROM [ISAdminDB].[dbo].[UserUnitAudit] WHERE [UserUnitAudit].[ActionDateTime] < @TwoYearsAgo

--insert into archive table
INSERT INTO [ISArchive].[dbo].[userunitaudit] 
(<Fields>)
SELECT  <Fields>
FROM  [ISAdminDB].[dbo].[UserUnitAudit] AS a
        INNER JOIN @NextIDs AS b ON a.UniqueID = b.UniqueID

--remove from Admin DB
DELETE [ISAdminDB].[dbo].[UserUnitAudit] 
FROM  [ISAdminDB].[dbo].[UserUnitAudit] AS a
INNER JOIN @NextIDs AS b ON a.UniqueID = b.UniqueID 

DELETE FROM @NextIDs

COMMIT

END
+1  A: 

Are there any indexes on myLargeTable.actionDate and .UniqueID?

Jonas Lincoln
There is a clusered index on the actionDate, but nothing on the uniqueID. There are no indexes on the archive table it is being inserted into.
Kevin
You need an index on myLargeTable.UniqueId for the JOINs. Check the execution plan in the Query Analyzer and you'll probably see table scans.
Jonas Lincoln
Thanks Jonas, I'm going to go talk to the guy that designed the DB to see why we have no index on the uniqueID field. Seems like it would make sense...
Kevin
+1  A: 

Have you tried larger batch sizes than 100?

What is taking the most time? The INSERT, or the DELETE?

Joe
When I crank the batch size up to 1000, both the Insert and Delete take about 3 and a half minutes to complete when run separately. The initial insert in to NextIDs takes only a second.
Kevin
+3  A: 

You effectively have three selects which need to be run before your insert/delete commands are executed:

for the 1st insert:

SELECT top 100 from myLargeTable Where myLargeTable.actionDate < twoYearsAgo

for the 2nd insert:

SELECT <fields> FROM myLargeTable INNER JOIN NextIDs 
on myLargeTable.UniqueID = NextIDs.UniqueID

for the delete:

(select *)
FROM MyLargeTable INNER JOIN NextIDs on myLargeTable.UniqueID = NextIDs.UniqueID

I'd try and optimize these and if they are all quick, then the indexes may be slowing down your writes. Some suggestions:

  1. start profiler and see what's happenng with the reads/writes etc.

  2. check index usage for all three statements.

  3. try running the SELECTs returning only the PK, to see if the delay is query execution or fetching the data (do have e.g. any fulltext-indexed fields, TEXT fields etc.)

davek
+4  A: 

Do you have an index on the source table for the column which you're using to filter the results? In this case, that would be the actionDate.

Also, it can often help to remove all indexes from the destination table before doing massive inserts, but in this case you're only doing 100's at a time.

You would also probably be better off doing this in larger batches. With one hundred at a time the overhead of the queries is going to end up dominating the costs/time.

Is there any other activity on the server during this time? Is there any blocking happening?

Hopefully this gives you a starting point.

If you can provide the exact code that you're using (maybe without the column names if there are privacy issues) then maybe someone can spot other ways to optimize.

EDIT: Have you checked the query plan for your block of code? I've run into issues with table variables like this where the query optimizer couldn't figure out that the table variable would be small in size so it always tried to do a full table scan on the base table.

In my case it eventually became a moot point, so I'm not sure what the ultimate solution is. You can certainly add a condition on the actionDate to all of your select queries, which would at least minimize the effects of this.

The other option would be to use a normal table to hold the IDs.

Tom H.
See my edit in the OP. I think this answers all your questions.
Kevin
A: 

You might try doing this using the output clause:

declare @items table (
  <field list just like source table> )

delete top 100 source_table
  output deleted.first_field, deleted.second_field, etc
  into @items
  where <conditions>

insert archive_table (<fields>)
  select (<fields>) from @items

You also might be able to do this in a single query, by doing 'output into' directly into the archive table (eliminating the need for the table var)

Ray
+1  A: 

The INSERT and DELETE statements are joining on

[ISAdminDB].[dbo].[UserUnitAudit].UniqueID

If there's no index on this, and you indicate there isn't, you're doing two table scans. That's likely the source of the slowness, b/c a SQL Server table scan reads the entire table into a scratch table, searches the scratch table for matching rows, then drops the scratch table.

I think you need to add an index on UniqueID. The performance hit for maintaining it has got to be less than table scans. And you can drop it after your archive is done.

DaveE
This actually lead me to the solution. Instead of keeping track of the rows I needed to move by UniqueID which has no index, I simply used the WHERE [ActionDateTime] < @TwoYearsAgo clause on my insert and delete, and viola, much faster.
Kevin