views:

119

answers:

5

I have a table with ~17 million rows in it. I need to de-duplicate the rows in the table. Under normal circumstances this wouldn't be a challenge, however, this isn't a normal circumstance. Normally 'duplicate rows' is defined as two or more rows containing the exact same values for all columns. In this case 'duplicate rows' is defined as two or more rows that have the exact same values, but are also within 20 seconds of each other. I wrote a script that is still running after 19.5 hours, this isn't acceptable, but I'm not sure how else to do it. Here's the script:

begin
create table ##dupes (ID  int)
declare curOriginals cursor for 
select ID, AssociatedEntityID, AssociatedEntityType, [Timestamp] from tblTable

declare @ID    int
declare @AssocEntity int
declare @AssocType  int
declare @Timestamp  datetime
declare @Count   int

open curOriginals
fetch next from curOriginals into @ID, @AssocEntity, @AssocType, @Timestamp
while @@FETCH_STATUS = 0
begin
select @Count = COUNT(*) from tblTable where AssociatedEntityID = @AssocEntity and AssociatedEntityType = @AssocType 
and [Timestamp] >= DATEADD(ss, -20, @Timestamp) 
and [Timestamp] <= DATEADD(ss, 20, @Timestamp) 
and ID <> @ID
if (@Count > 0)
begin
insert into ##dupes (ID) 
(select ID from tblHBMLog where AssociatedEntityID = @AssocEntity and AssociatedEntityType = @AssocType 
and [Timestamp] >= DATEADD(ss, -20, @Timestamp) 
and [Timestamp] <= DATEADD(ss, 20, @Timestamp) 
and ID <> @ID)
print @ID
end
delete from tblHBMLog where ID = @ID or ID in (select ID from ##dupes)
fetch next from curOriginals into @ID, @AssocEntity, @AssocType, @Timestamp
end

close curOriginals
deallocate curOriginals

select * from ##dupes
drop table ##dupes
end

Any help would be greatly appreciated.

A: 

If you have enough memory and storage, may be faster this way:

  1. Create the new table with similar structure
  2. Copy all data by select with distinct to this temp table
  3. Clear original table (your should delete some constraints before this)
  4. Copy data back to original table

Instead of 3 and 4 steps you can rename drop original table and rename temp folder.

Andrey Shvydky
A: 

Putting the time differentiator aside, the first thing I would do is knock this list down to a much smaller subset of potential duplicates. For example, if you have 17 million rows, but only, say, 10 million have every field matching but the time, then you have just chopped a large portion of your processing off.

To do this I'd just whip up a query to dump the unique ID's of the potential duplicates into a temp table, then use this as an inner join on your cursor (again, this would be a first step).

In looking at the cursor, I see a lot of relatively heavy function calls which would explain your slowdowns. There's also a lot of data activity and I would not be suprised if you weren't being crushed by an I/O bottleneck.

One thing you could do then is rather than use the cursor, dump it into your programming language of choice. Assuming we've already limited all of our fields except for the timestamp down to a manageable set, grab each subset in turn (i.e. ones that match the remaining fields), since any dups would necessarily have all of their other fields matched. Then just snuff out the duplicates you find in these smaller atomic subsets.

So assuming you have 10 million potentials, and each time range has about 20 records or so that need to be worked through with the date logic, you're down to a much smaller number of database calls and some quick code - and from experience, knocking out the datetime comparisons, etc. outside of SQL is generally a lot faster.

Bottom line is to figure out ways to, as quickly as possible, partition your data down into managable subsets.

Hope that helps!

-Bob

Bob Palmer
+1  A: 

A quick tweak that should gain some speed would be to replace the nasty COUNT section with some EXISTS stuff :

IF EXISTS(SELECT 1 FROM tblTable WHERE AssociatedEntityID = @AssocEntity
    AND AssociatedEntityType = @AssocType AND [Timestamp] >= DATEADD(ss, -20, @Timestamp)
    AND [Timestamp] <= DATEADD(ss, 20, @Timestamp)
    AND ID <> @ID) //if there are any matching rows...
BEGIN
    DELETE FROM tblHBMLog
    OUTPUT deleted.ID INTO ##dupes
    WHERE AssociatedEntityID = @AssocEntity AND AssociatedEntityType = @AssocType 
        AND [Timestamp] >= DATEADD(ss, -20, @Timestamp) 
        AND [Timestamp] <= DATEADD(ss, 20, @Timestamp) //I think this is supposed to be within the block, not outside it
END

I've also now replaced the double references of ##dupes with the OUTPUT clause which will mean you're not scanning a growing ##dupes every time you delete a row. As far as the deletion goes, as you're deleting the ID and its matches in one go you don't need such an elaborate deletion clause. You've already checked that there are entries that need removing, and you seem to want to remove all the entries including the original.

Once you answer Paul's question, we can take a look at completely removing the cursor.

CodeByMoonlight
A: 

In answer to Paul's question:

What happens when you have three entries, a, b, c. a = 00 secs b = 19 secs c = 39 secs >Are these all considered to be the same time? ( a is within 20 secs of b, b is within 20 >secs of c )

If the other comparisons are equal (AssociatedEntityid and AssociatedEntityType) then yes, they are considered the same thing, otherwise no.


I would add to the original question, except that I used a different account to post the question and now can't remember my password. It was a very old account and didn't realize that I had connected to the site with it.

I have been working with some the answers you guys have given me and there is one problem, you're using only one key column (AssociatedEntityid) when there are two (AssociatedEntityID and AssociatedEntityType). Your suggestions would work great for a single key column.

What I have done so far is:

Step 1: Determine which AssociatedEntityID and AssociatedEntityType pairs have duplicates and insert them into a temp table:

create table ##stage1 (ID   int, AssociatedEntityID     int, AssociatedEntityType int, [Timestamp] datetime)

insert into ##stage1 (AssociatedEntityID, AssociatedEntityType)
    (select AssociatedEntityID, AssociatedEntityType from tblHBMLog group by AssociatedEntityID, AssociatedEntityType having COUNT(*) > 1)

Step 2: Retrieve the ID of the earliest occurring row with a given AssociatedEntityID and AssociatedEntityType pair:

declare curStage1 cursor for 
    select AssociatedEntityID, AssociatedEntityType from ##stage1

open curStage1  
fetch next from curStage1 into @AssocEntity, @AssocType
while @@FETCH_STATUS = 0
begin
    select top 1 @ID = ID, @Timestamp = [Timestamp] from tblHBMLog where AssociatedEntityID = @AssocEntity and AssociatedEntityType = @AssocType order by [Timestamp] asc
    update ##stage1 set ID = @ID, [Timestamp] = @Timestamp where AssociatedEntityID = @AssocEntity and AssociatedEntityType = @AssocType
end

And this is where things slow down again. Now, granted, the result set has been pared down from ~17 million to just under 400,000, but it is still taking quite a long time to run through.

I guess another question that I should ask is this; If I continue to write this in SQL will it just have to take quite a long time? Should I write this in C# instead? Or am I just stupid and not seeing the forest for the trees of this solution?


Well, after much stomping of feet and gnashing of teeth, I have come up with a solution. It's just a simple, quick and dirty C# command line app, but it's faster than the sql script and it does the job.

I thank you all for your help, in the end the sql script was just taking too much time to execute and C# is much better suited for looping.

B0fh
you should **update** your original question - not answer it yourself with some additional info...
marc_s
+1  A: 

Basically, I agree with Bob. 1st of all, you have way too many things done in your code to be repeated 17 million times. 2nd, you could crop your set down to the absolute duplicates. 3rd it would be nicer if you had enough memory (which you should) and try and solve this in your programming language of choice.

Anyway, for the sake of a hardcoded answer, and because your query might still be running, I will try to give a working script which I think (?) does what you want.

First of all you should have an Index. I would recommend an index on the AssociatedEntityID field. If you already have one, but your table has been populated with lots of data after you created the index, then drop it and recreate it, in order to have fresh statistics.

Then see the script below, which does the following:

  1. dumps all duplicates in the ##dupes, ignoring the 20 secs rule
  2. it sorts them out (by AssociatedEntityID, Timestamp) and starts the simplest straight forward loop it can do.
  3. checks for duplicate AssociatedEntityID and the timestamp inside the 20 sec interval. if all true, then inserts the id to the ##dupes_to_be_deleted table.

There is the assumption that if you have a set of more than two duplicates, in sequence, then the script eliminates every duplicate in the range of 20 secs from the first one. Then, from the next remaining, if any, it resets and goes for another 20 secs, and so on...

Here is the script, it may be useful to you, though did not have the time to test it

CREATE TABLE ##dupes
             (
                          ID                 INT ,
                          AssociatedEntityID INT ,
                          [Timestamp]        DATETIME
             )
CREATE TABLE ##dupes_to_be_deleted
             (
                          ID INT
             )

-- collect all dupes, ignoring for now the rule of 20 secs
INSERT
INTO   ##dupes
SELECT ID                 ,
       AssociatedEntityID ,
       [Timestamp]
FROM   tblTable
WHERE  AssociatedEntityID IN
       ( SELECT  AssociatedEntityID
       FROM     tblTable
       GROUP BY AssociatedEntityID
       HAVING   COUNT(*) > 1
       )

-- then sort and loop on all of them
-- using a cursor
DECLARE c CURSOR FOR
SELECT   ID                 ,
         AssociatedEntityID ,
         [Timestamp]
FROM     ##dupes
ORDER BY AssociatedEntityID,
         [Timestamp]

-- declarations
DECLARE @id                     INT,
        @AssociatedEntityID     INT,
        @ts                     DATETIME,
        @old_AssociatedEntityID INT,
        @old_ts                 DATETIME

-- initialisation
SELECT @old_AssociatedEntityID = 0,
       @old_ts                 = '1900-01-01'

-- start loop
OPEN c
FETCH NEXT
FROM  c
INTO  @id                ,
      @AssociatedEntityID,
      @ts
WHILE @@fetch_status = 0
BEGIN
        -- check for dupe AssociatedEntityID
        IF @AssociatedEntityID = @old_AssociatedEntityID
        BEGIN
                -- check for time interval
                IF @ts <= DATEADD(ss, 20, @old_ts )
                BEGIN
                        -- yes! it is a duplicate
                        -- store it in ##dupes_to_be_deleted
                        INSERT
                        INTO   ##dupes_to_be_deleted
                               (
                                      id
                               )
                               VALUES
                               (
                                      @id
                               )
                END
                ELSE
                BEGIN
                        -- IS THIS OK?:
                        -- put last timestamp for comparison
                        -- with the next timestamp
                        -- only if the previous one is not going to be deleted.
                        -- this way we delete all duplicates
                        -- 20 secs away from the first of the set of duplicates
                        -- and the next one remaining will be a duplicate
                        -- but after the 20 secs interval.
                        -- and so on ...
                        SET @old_ts = @ts
                END
        END

        -- prepare vars for next iteration
        SELECT @old_AssociatedEntityID = @AssociatedEntityID
        FETCH NEXT
        FROM  c
        INTO  @id                ,
              @AssociatedEntityID,
              @ts
END
CLOSE c
DEALLOCATE c


-- now you have all the ids that are duplicates and in the 20 sec interval of the first duplicate in the ##dupes_to_be_deleted
DELETE
FROM       <wherever> -- replace <wherever> with tblHBMLog?
WHERE  id IN
       ( SELECT id
       FROM    ##dupes_to_be_deleted
       )
DROP TABLE ##dupes_to_be_deleted
DROP TABLE ##dupes

You could give a try and leave it for a couple of hours. Hope it helps.

daskd