views:

50

answers:

3

Hello everyone,

I have a very large table (more than 300 millions records) that will need to be cleaned up. Roughly 80% of it will need to be deleted. The database software is MS SQL 2005. There are several indexes and statistics on the table but not external relationships.

The best solution I came up with, so far, is to put the database into "simple" recovery mode, copy all the records I want to keep to a temporary table, truncate the original table, set identity insert to on and copy back the data from the temp table.

It works but it's still taking several hours to complete. Is there a faster way to do this ?

A: 

I am not sure what the structure of your data is. When does a row become eligible for deletion? If it is a purely ID based on date based thing then you can create a new table for each day, insert your new data into the new tables and when it comes to cleaning simply drop the required tables. Then for any selects construct a view over all the tables. Just an idea.

EDIT: (In response to comments) If you are maintaining a view over all the tables then no it won't be complicated at all. The complex part is coding the dropping and recreating of the view.

I am assuming that you don't want you data to be locked down too much during deletes. Why not chunk the delete operations. Created a SP that will delete the data in chunks, 50 000 rows at a time. This should make sure that SQL Server keeps a row lock instead of a table lock. Use the

WAITFOR DELAY 'x'

In your while loop so that you can give other queries a bit of breathing room. Your problem is the old age computer science, space vs time.

uriDium
The data already exists and it's already consumed by an application. Beside, it would make searching the remaining data rather complex.
Stephane
@uriDium - That will still be logged though. The OP is trying to avoid the overhead from logging to make it quicker and hasn't mentioned anything about minimising impact on concurrent transactions. I presume there might not be any.
Martin Smith
+1  A: 

As per the comments my suggestion would be to simply dispense with the copy back step and promote the table containing records to be kept to become the new main table by renaming it.

It should be quite straightforward to script out the index/statistics creation to be applied to the new table before it gets swapped in.

The clustered index should be created before the non clustered indexes.

A couple of points I'm not sure about though.

  1. Whether it would be quicker to insert into a heap then create the clustered index afterwards. (I guess no if the insert can be done in clustered index order)
  2. Whether the original table should be truncated before being dropped (I guess yes)
Martin Smith
I guess I'll be creating the clustered index first but it shouldn't really matter: it's a surrogate key anyway so doing the initial select should have created the table in the correct order already. I'll make sure by adding an ORDER BY statement, though.Why would it be necessary to truncate the table before dropping it ? Neither operation is log-backed, as far as I know.I'll try this and report the result.
Stephane
@Stephane I simply wasn't sure on that point. I'd be interested if you confirm either way.
Martin Smith
+1  A: 

@uriDium -- Chunking using batches of 50,000 will escalate to a table lock, unless you have disabled lock escalation via alter table (sql2k8) or other various locking tricks.

etliens