tags:

views:

498

answers:

15

I have a table with about 10 fields to store gps info for customers. Over time as we have added more customers that table has grown to about 14 million rows. As the gps data comes in a service constantly inserts a row into the table. 90% of the data is not revelent i.e. the customer does not care where the vehicle was 3 months ago, but the most recent data is used to generate tracking reports. My goal is to write a sql to perform a purge of the data that is older than a month.

Here is my problem I can NOT use TRUNCATE TABLE as I would lose everything? Yesterday I wrote a delete table statement with a where clause. When I ran it on a test system it locked up my table and the simulation gps inserts were intermittently failing. Also my transaction log grew to over 6GB as it attempted to log each delete.

My first thought was to delete the data a little at a time starting with the oldest first but I was wondering if there was a better way.

+2  A: 

Can you copy recent data to a new table, truncate the table, then copy it back?

Of course, then you're going to need to worry about doing that again in 6 months or a year.

Michael Todd
+3  A: 

Better is to create a temporary table and insert only the data you want to keep. Then truncate your original table and copy back the backup.

Oracle syntax (SQL Server is similar)

create table keep as select * from source where data_is_good = 1;
truncate table source;
insert into source select * from keep;

You'll need to disable foreign keys, if there are any on the source table.

In Oracle, index names must be unique in the entire schema, not just per-table. In SQL server, you can further optimize this by just renaming "keep" to "source", as you can easily create indexes of the same name on both tables

Matt
If data is continually being inserted he can't use the rename since he could lose any data inserted between the time of the insert and the rename. He could put it in a transaction, but that might cause timeouts.
Tom H.
+9  A: 

My 2 cents:

If you are using SQL 2005 and above, you can consider to partition your table based on the date field, so the table doesn't get locked when deleting old records.

Maybe, if you are in position of making dba decisions, you can temporarily change your log model to Simple, so it won't grow up too fast, it will still be growing, but the log won't be too detailed.

Jhonny D. Cano -Leftware-
Thanks I will check this out
Ron Skufca
+1  A: 

I would probably do it in batches as you have already come up with. Another option would be to insert the important data into another table, truncate the GPS table, then reinsert the important data. You would have a small window where you would be missing the recent historical data. How small that window is would depend on how much data you needed to reinsert. Also, you would need to be careful if the table uses autoincrementing numbers or other defaults so that you use the original values.

Once you have the table cleaned up, a regular cleaning job should be scheduled. You might also want to look into partitioning depending on your RDBMS.

Tom H.
I have thought about that and after looking at the other responses I am seeing some merit in this approach as well. Plus it is only gps position data and missing a couple minutes of data might not be an issue but for a moving vehicle any longer it might be an issue. My customers are police and with my luck the minute I stop the gps they will get into a high speed pursuit which they will want to play back and I will have no data for them!
Ron Skufca
+3  A: 

If you're using SQL Server 2005 or 2008, sliding window partitioning is the perfect solution for this - instant archiving or purging without any perceptible locking. Have a look here for further information.

Aaron Alton
Partitioned tables works only on Enterprise Edition :(
dario-g
+1  A: 

I assume you can't down the production system (or queue up the GPS results for insertion after the purge is complete).

I'd go with your inclination of deleting a fraction of it at a time (perhaps 10%) depending on the performance you find on your test system.

Is your table indexed? That might help, but the indexing process my have simmilar effects on the system as doing the one great purge.

CodeSlave
Yep I can't take the system down and yes the table is index properly.
Ron Skufca
+2  A: 

Try this

WHILE EXISTS ( SELECT * FROM table WHERE (condition for deleting))

BEGIN
SET ROWCOUNT 1000
DELETE Table WHERE (condition for deleting)
SET ROWCOUNT 0
ENd

This will delete the rows in groups of 1000

Eric
Did this help you?
Eric
I would add an WAITFOR DELAY '00:00:04' for avoiding the locks
Jhonny D. Cano -Leftware-
Thanks, this looks like what I am looking for. I will test later today.
Ron Skufca
glad i could help.
Eric
I am testing on a database with approx 14 million rows. I added the WAITFOR DELAY '00:00:05'
Ron Skufca
Has it worked or is it still running?
Eric
It was taking forever so I posted my solution below. I used ideas from most of the input. It did require me to take the GPS service down for about a half hour.
Ron Skufca
I'm currently investigating a similar problem, and tried this solution. It helps reducing locking issues, but I still have problems with the transaction log growing too much while records are being deleted. @Ron Skufca: Were you able to solve the transaction log issue in your final solution?
ckarras
+2  A: 

I would do a manual delete by day/month (whatever is largest unit you can get away with.) Once you do that first one, then write a stored proc to kick off every day that deletes the oldest data you don't need.

DELETE FROM TABLENAME 
WHERE datediff(day,tableDateTime,getdate() > 90

Personally, I hate doing stuff to production datasets where one missed key results in some really bad things happening.

Dayton Brown
I agree this is why I am testing this on a non production server so if everything works I just copy the sql.
Ron Skufca
A: 

Keep in mind that most databases lock the neighboring records in an index during an transaction, so keeping your operations short will be helpful. I'm assuming that your insertions are failing on lock wait timeouts, so delete your data in small, bursty transactions. I'd suggest a single-threaded Perl script that loops through in the oldest 1,000 chunk increments. I hope your primary key (and hopefully clustered index incase they somehow ended up being two different things) can be correlated to time as that would be the best thing to delete by.

PseudoSQL: Select max(primId) < 3_months_ago Delete from table where primId < maxPrimId limit 1000

Now, here's the really fun part: All these deletions MAY make your indexes a mess and require that they be rebuilt to keep the machine from getting slow. In that case, you'll either have to swap in an up-to-date slave, or just suffer some downtime. Make sure you test for this possible case on your test machine.

Autocracy
A: 

If you are using oracle, i would set up a partition by date on your tables and the indexes. Then you delete the data by dropping the partition... the data will magically go away with the partition.

This is an easy step - and doesn't clog up your redo logs etc.

There's a basic intro to all this here

time4tea
+2  A: 

Welcome to Data Warehousing. You need to to split your data into two parts.

  • The actual application, with current data only.

  • The history.

You need to do write a little "ETL" job to move data from current to history and delete the history that was moved.

You need to run this periodically. Daily - weekly - monthly quarterly -- doesn't matter technically. What matters is what use the history has and who uses it.

S.Lott
A: 

Does the delete statement use any of the indexes on the table? Often times a huge performance improvement can be obtained by either modifying the statement to use an existing index or to add an index on the table that helps improve the performance of the query that the delete statement does.

Also, as other mentioned, the deletes should be done in multiple chunks instead of one huge statement. This prevents the table from getting locked too long, and having other processes time out waiting for the delete to finish.

Adam Porad
A: 

Performance is pretty fast when dropping a table- even a very large one. So here is what I would do. Script out your table complete with indexes from Management Studio. Edit the script and run it to create a copy of your table. Call it table2. Do a select insert to park the data you want to retain into the new table2. Rename the old table, say tableOld. Rename table2 with the original name. Wait. If no one screams at you drop table2. There is some risk. 1) Check if there are triggers or constraints defined on the original table. They may not get included in the script generated by management studio. 2) if original table has identity fields you may have to turn on identity_insert before inserting into the new table.

BishopOfNairobi
A: 

I came up with the following T-SQL script which gets an arbitrary amount of recent data.

IF EXISTS(SELECT name FROM sys.tables WHERE name = 'tmp_xxx_tblGPSVehicleInfoLog')
BEGIN
    PRINT 'Dropping temp table tmp_xxx_tblGPSVehicleInfoLog'
    DROP TABLE tmp_xxx_tblGPSVehicleInfoLog
END
GO

PRINT 'Creating temp table tmp_xxx_tblGPSVehicleInfoLog'
CREATE TABLE [dbo].[tmp_xxx_tblGPSVehicleInfoLog](
    [GPSVehicleInfoLogId] [uniqueidentifier] NOT NULL,
    [GPSVehicleInfoId] [uniqueidentifier] NULL,
    [Longitude] [float] NULL,
    [Latitude] [float] NULL,
    [GroundSpeed] [float] NULL,
    [Altitude] [float] NULL,
    [Heading] [float] NULL,
    [GPSDeviceTimeStamp] [datetime] NULL,
    [Milliseconds] [float] NULL,
    [DistanceNext] [float] NULL,
    [UpdateDate] [datetime] NULL,
    [Stopped] [nvarchar](1) NULL,
    [StopTime] [datetime] NULL,
    [StartTime] [datetime] NULL,
    [TimeStopped] [nvarchar](100) NULL
) ON [PRIMARY]
GO

PRINT 'Inserting data from tblGPSVehicleInfoLog to tmp_xxx_tblGPSVehicleInfoLog'
SELECT * INTO tmp_xxx_tblGPSVehicleInfoLog 
FROM tblGPSVehicleInfoLog 
WHERE tblGPSVehicleInfoLog.UpdateDate between '03/30/2009 23:59:59' and '05/19/2009  00:00:00'
GO

PRINT 'Truncating table tblGPSVehicleInfoLog'
TRUNCATE TABLE tblGPSVehicleInfoLog
GO

PRINT 'Inserting data from tmp_xxx_tblGPSVehicleInfoLog to tblGPSVehicleInfoLog'
INSERT INTO tblGPSVehicleInfoLog 
SELECT * FROM tmp_xxx_tblGPSVehicleInfoLog 
GO
Ron Skufca
A: 

To keep the transaction log from growing out of control, modify it in the following way:

DECLARE @i INT
SET @i = 1
SET ROWCOUNT 10000

WHILE @i > 0
BEGIN
    BEGIN TRAN
        DELETE TOP 1000 FROM dbo.SuperBigTable
     WHERE RowDate < '2009-01-01'
    COMMIT
    SELECT @i = @@ROWCOUNT
END
SET ROWCOUNT 0

And here is a version using the preferred TOP syntax for SQL 2005 and 2008:

DECLARE @i INT
SET @i = 1

WHILE @i > 0
BEGIN
    BEGIN TRAN
        DELETE TOP 1000 FROM dbo.SuperBigTable
     WHERE RowDate < '2009-01-01'
    COMMIT
    SELECT @i = @@ROWCOUNT
END
BradC