views:

3163

answers:

11

This has been an adventure. I started with the looping duplicate query located in my previous question, but each loop would go over all 17 million records...meaning it would take weeks (just running *select count * from MyTable* takes my server 4:30 minutes => mssql 2005). I gleamed information from this site and at this post:

http://weblogs.sqlteam.com/jeffs/archive/2007/03/28/60146.aspx

And have arrived at the query below. The question is, is this the correct type of query to run on 17 million records for any type of performance? If it isn't, what is?

SQL QUERY:

DELETE tl_acxiomimport.dbo.tblacxiomlistings
WHERE RecordID in 
(SELECT RecordID
    FROM tl_acxiomimport.dbo.tblacxiomlistings
    EXCEPT
    SELECT RecordID
    FROM (
        SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude,           Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
    FROM tl_acxiomimport.dbo.tblacxiomlistings
    ) al WHERE Rank = 1)
+1  A: 

Run this in query analyzer:

SET SHOWPLAN_TEXT ON

Then ask query analyzer to run your query. Instead of running the query, SQL Server will generate a query plan and put it in the result set.

Show us the query plan.

David B
How do you do that exactly? I can show the estimated plan, but in SQL server 2005, how would I display what you are looking for?
RyanKeeter
Step 1: run the command I have described in a query window "SET SHOWPLAN_TEXT ON" . Step 2: run the query you want analyzed in the same window.
David B
A: 

This looks fine but you might consider selecting your data into a temporary table and using that in your delete statement. I've noticed huge performance gains from doing this instead of doing it all in that one query.

Jon
+1  A: 

17 million records is nothing. If it takes 4:30 to just do a select count(*) then there is a serious problem, probably related to either lack of memory in the server or a really old processor.

For performance, fix the machine. Pump it up to 2GB. RAM is so cheap these days that its cost is far less than your time.

Is the processor or disk thrashing when that query is going? If not, then something is blocking the calls. In that case you might consider putting the database in single user mode for the amount of time it takes to run the cleanup.

Chris Lively
+1  A: 

So you're deleting all the records that aren't ranked first? It might be worth comparing a join against a top 1 sub query against (which might also work in 2000, as rank is 2005 and above only)

Do you need to remove all the duplicates in a single operation? I assume that you're preforming some sort of housekeeping task, you might be able to do it piece-wise.

Basically create a cursor that loops all the records (dirty read) and removes dupes for each. It'll be a lot slower overall, but each operation will be relatively minimal. Then your housekeeping becomes a constant background task rather than a nightly batch.

Keith
+2  A: 

Something's up with your DB, server, storage or some combination thereof. 4:30 for a select count * seems VERY high.

Run a DBCC_SHOWCONTIG to see how fragmented your table is, this could cause a major performance hit over a table that size.

Also, to add on to the comment by RyanKeeter, run the show plan and if there are any table scans create an index for the PK field on that table.

Mike Reedell
Creating an(other) index on the PK field is very very rarely useful.
David B
+1  A: 

The suggestion above to select into a temporary table first is your best bet. You could also use something like:

set rowcount 1000

before running your delete. It will stop running after it deletes the 1000 rows. Then run it again and again until you get 0 records deleted.

TrevorD
We used to use the RowCount trick in the past when deleting lots of data. It prevents the transaction from getting too big.
Darrel Miller
+5  A: 

Seeing the QueryPlan would help.

Is this feasible?

SELECT m.*
into #temp
FROM tl_acxiomimport.dbo.tblacxiomlistings m 
inner join (SELECT RecordID, 
                   Rank() over (Partition BY BusinessName, 
                                             latitude,  
                                             longitude,            
                                             Phone  
                                ORDER BY webaddress DESC,  
                                         caption1 DESC,  
                                         caption2 DESC ) AS Rank
              FROM tl_acxiomimport.dbo.tblacxiomlistings
           ) al on (al.RecordID = m.RecordID and al.Rank = 1)

truncate table tl_acxiomimport.dbo.tblacxiomlistings

insert into tl_acxiomimport.dbo.tblacxiomlistings
     select * from #temp
TrickyNixon
TrickyNixon, how would I display the query plan in an output that would be good for you to read? I can display the estimated execution, could you assist?
RyanKeeter
This is the one that actually worked and it did it very fast, thank you very much.
RyanKeeter
Make sure you kept all the records you should have!
TrickyNixon
+1  A: 

if i get it correctly you query is the same as

DELETE tl_acxiomimport.dbo.tblacxiomlistings
FROM
    tl_acxiomimport.dbo.tblacxiomlistings allRecords
    LEFT JOIN (   
        SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude, Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
     FROM tl_acxiomimport.dbo.tblacxiomlistings
     WHERE Rank = 1) myExceptions
    ON allRecords.RecordID = myExceptions.RecordID
WHERE
    myExceptions.RecordID IS NULL

I think that should run faster, I tend to avoid using "IN" clause in favor of JOINs where possible.

You can actually test the speed and the results safely by simply calling SELECT * or SELECT COUNT(*) on the FROM part like e.g.

SELECT *
FROM
    tl_acxiomimport.dbo.tblacxiomlistings allRecords
    LEFT JOIN (   
        SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude, Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
        FROM tl_acxiomimport.dbo.tblacxiomlistings
        WHERE Rank = 1) myExceptions
    ON allRecords.RecordID = myExceptions.RecordID
WHERE
    myExceptions.RecordID IS NULL

That is another reason why I would prefer the JOIN approach I hope that helps

kristof
+2  A: 

Wouldn't it be more simple to do:

DELETE tl_acxiomimport.dbo.tblacxiomlistings
WHERE RecordID in 
(SELECT RecordID
   FROM (
        SELECT RecordID,
            Rank() over (Partition BY BusinessName,
                                  latitude,
                                  longitude,
                                  Phone
                         ORDER BY webaddress DESC,
                                  caption1 DESC,
                                  caption2 DESC) AS Rank
        FROM tl_acxiomimport.dbo.tblacxiomlistings
        )
  WHERE Rank > 1
  )
David Aldridge
Yes, but then we wouldn't get to use the really cool EXCEPT clause that I just learned about....spoilsport :-)
Darrel Miller
Sorry. New stuff is always cool.
David Aldridge
A: 

Other than using truncate as suggested, I've had the best luck using this template for deleting lots of rows from a table. I don't remember off hand, but I think using the transaction helped to keep the log file from growing -- may have been another reason though -- not sure. And I usually switch the transaction logging method over to simple before doing something like this:

SET ROWCOUNT 5000
WHILE 1 = 1
BEGIN
    begin tran
            DELETE FROM ??? WHERE ???
            IF @@rowcount = 0
            BEGIN
               COMMIT
               BREAK
            END
    COMMIT
END
SET ROWCOUNT 0
Chris
A: 

Remember when doing a large delete it is best to have a good backup first.(And I also usually copy the deleted records to another table just in case, I need to recover them right away. )

HLGEM