views:

866

answers:

5

Every night I need to trim back a table to only contain the latest 20,000 records. I could use a subquery:

delete from table WHERE id NOT IN (select TOP 20000 ID from table ORDER BY date_added DESC)

But that seems inefficient, especially if we later decide to keep 50,000 records. I'm using SQL 2005, and thought I could use ROW_NUMBER() OVER somehow to do it? Order them and delete all that have a ROW_NUMBER greater than 20,000? But I couldn't get it to work. Is the subquery my best bet or is there a better way?

+6  A: 

If it just seems inefficient, I would make sure it is inefficient before I start barking up the wrong tree.

Measure the time, cpu usage, disk I/O, etc. to see how well it performs. I think you'll find it performs better than you think.

Lasse V. Karlsen
You are right, its only taking 3 seconds to clear the table with around 50,000 records in it. I thought IN() clauses were very inefficient, but maybe thats just when you actually pass in a textual list of IDs. Thanks for the help.
Yes, an IN() clause with 20,000 comma-separated IDs would probably be pretty inefficient. I'll bet it would still execute in something like 10-15 seconds, though.
MusiGenesis
That's if it agreed to even parse a string that long, of course.
MusiGenesis
Check out my alternative solution (http://stackoverflow.com/questions/285614/how-to-delete-all-but-the-latest-20000-records-in-ms-sql-2005#285914) that avoids nested query. I tested 3 solutions with 60k rows of data and it turned out to be the fastest according to the execution plan.
Haoest
A: 

Surely this is a prime case for wrapping up into a procedure and using two sql statements - the first to select the latest ID and subtract 20,000, then the second to delete all rows with ID's lower than this.

However it does on the face of it sound like you're going to end up with a lot of fragmentation going with this approach and that might be a good argument for creating a new table, inserting the latest 20,000 records into it, deleting the old one and renaming the new. It might even be worthwhile putting the table in a different database and creating a view from your main database to facilitate access. Myself I generally tend to do this with tables used for data load and audit.

It's very difficult to tell without knowing your actual data volumes and behavior, but it could well be that globally your inefficiencies will arise more from this than the delete method you use. If you're only collecting a thousand or less records a day then a delete is probably ok combined with running a data optimization maintenance plan, but more and I'd be looking at the more drastic approach.

Cruachan
I thought of doing it the way you describe in your first paragraph, but that assumes there are no gaps in the record IDs. I think this wil be the case, and that may work.
+1  A: 

Of course, your mileage will vary -- This will depend on how many real records you are scraping off the bottom of this table, but here's an alternative.

Side Note: Since you have a "Date_Added" field, would it be worth considering to simply keep the datetime of the last run and use that in your where clause to filter the records to be removed? Now, instead of 20,000 records, allow X number of days in the log ... Just a thought...


-- Get the records we want to KEEP into a temp.
-- You can classify the keepers however you wish.

select top 20000 * into #myTempTable from MyTable ORDER BY DateAdded DESC

-- Using truncate doesn't trash our log file and uses fewer sys resources...

truncate table MyTable

-- Bring our 'kept' records back into the fold ...
-- This assumes that you are NOT using an identity column -- if you are, you should
-- specify the field names instead of using the '*' and do something like
-- SET IDENTITY_INSERT MyTable ON
-- insert into MyTable select field1,field2,field3 from #myTempTable
-- (I think that's right)

insert into MyTable select * from #myTempTable

-- be a good citizen.

drop table #myTempTable



Hope it helps --

Borzio
+1  A: 
DECLARE @limit INT
SELECT @limit = min(id) FROM
   (SELECT TOP 20000 id FROM your_table ORDER BY id DESC)x
DELETE FROM your_table where id < @limit

The point was to avoid the nested query, which I may or may not be optimized (sorry not sql guru.)

Haoest
Both this one and the temp table approach are great ideas that I would have never thought of. I love this site.
A: 

You question implies that you are trimming to get better daytime performance from the table. Are you getting table scans on the daytime queries? Wouldn't better indexes be the answer? Or are you in a situation where you are stuck with a "crappy schema"?

Or do have some really strange situation where you indeed need to purge old records? Is 20,000 a hard and fast number? Or could a datetime work? Then and index on the datetime column would make trimming a bit easier.

John Dyer
I was originally planning on using a date, like deleting all records older than 2 weeks. But the client specifically wanted to keep an exact number instead. His reasoning was that we can't accidentlly run out of space if something goes bezerk over a few days.