ansaurus

Question

Answer 1

A:

How about a two-step approach, assuming you can go offline during a data load:

Mark every item as duplicate.
Select the earliest row from each group, and clear the duplicate flag.

Not elegant, but gets the job done.

Adam Liss 2008-10-28 02:55:22

This can easily be accomplished with a single, rather easy query. No reason to go to these lengths to complicate things.

sliderhouserules 2008-10-31 16:53:48

Answer 2

A:

Here's a funny way to do it:

SET @rowid := 0;

UPDATE mytable
SET duplicate = (rowid = @rowid), 
    rowid = (@rowid:=rowid)
ORDER BY rowid, created_date;

First set a user variable to zero, assuming this is less than any rowid in your table.
Then use the MySQL UPDATE...ORDER BY feature to ensure that the rows are updated in order by rowid, then by created_date.
For each row, if the current rowid is not equal to the user variable @rowid, set duplicate to 0 (false). This will be true only on the first row encountered with a given value for rowid.
Then add a dummy set of rowid to its own value, setting @rowid to that value as a side effect.
As you UPDATE the next row, if it's a duplicate of the previous row, rowid will be equal to the user variable @rowid, and therefore duplicate will be set to 1 (true).

Edit: Now I have tested this, and I corrected a mistake in the line that sets duplicate.

Bill Karwin 2008-10-28 05:43:26

Answer 3

A:

I don't know the MySQL syntax, but in PLSQL you just do:

UPDATE t1
SET duplicate = 1
FROM MyTable t1
WHERE rowid != (
  SELECT TOP 1 rowid FROM MyTable t2
  WHERE t2.fingerprint = t1.fingerprint ORDER BY created_date DESC
)

That may have some syntax errors, as I'm just typing off the cuff/not able to test it, but that's the gist of it.

MySQL version (not tested):

UPDATE t1
  SET duplicate = 1
FROM MyTable t1
WHERE rowid != (
  SELECT rowid FROM MyTable t2
  WHERE t2.fingerprint = t1.fingerprint
  ORDER BY created_date DESC
  LIMIT 1
)

sliderhouserules 2008-10-29 06:22:12

SELECT TOP is a Microsoft SQL Server feature. It is not supported in Oracle or MySQL.

Bill Karwin 2008-10-31 18:21:10

Just looked up the MySQL syntax, it's LIMIT.

sliderhouserules 2008-11-01 01:26:20

Answer 4

A:

Here's another way to do it, using MySQL's multi-table UPDATE syntax:

UPDATE mytable m1
  JOIN mytable m2 ON (m1.rowid = m2.rowid AND m1.created_date < m2.created_date)
SET m2.duplicate = 1;

Bill Karwin 2008-10-31 18:23:59

Doesn't account for duplicate dates...

sliderhouserules 2008-11-01 01:30:31

Oh, yes, you're right. It assumes each date is unique. Ah well.

Bill Karwin 2008-11-01 04:50:46

True, but you could do m1.primary_key < m2.primary_key. I know the OP said he wanted to keep the first record by creation date - wouldn't the primary keys of the table (we're talking an UNIQUE AUTO INCREMENT field here, since it's MySQL) necessarily be in ascending chronological order?

Chris 2009-07-14 20:41:47

@Chris: Yes, *usually* that's true. But you might not be able to assume rows are inserted in chronological order. That is, `m1.primary_key < m2.primary_key` might not guarantee that `m1.created_date < m2.created_date`. YMMV.

Bill Karwin 2009-07-16 23:37:24

Answer 5

+1 A:

MySQL needs to be explicitly told if the data you are grouping by is larger than 1024 bytes (see this link for details). So if your data in the fingerprint column is larger than 1024 bytes you should use set the max_sort_length variable (see this link for details about values allowed, and this link about how to set it) to a larger number so that the group by wont silently use only part of your data for grouping.

Once you're certain that MySQL will group your data properly, the following query will set the duplicate flag so that the first fingerprint record has duplicate set to FALSE/0 and any subsequent fingerprint records have duplicate set to TRUE/1:

    UPDATE mytable m1
INNER JOIN (SELECT fingerprint
                 , MIN(rowid) AS minrow 
              FROM mytable m2 
          GROUP BY fingerprint) m3 
        ON m1.fingerprint = m3.fingerprint
       SET m1.duplicate = m3.minrow != m1.rowid;

Please keep in mind that this solution does not take NULLs into account and if it is possible for the fingerprint field to be NULL then you would need additional logic to handle that case.

Dipin 2009-03-20 20:47:41

Answer 6

A:

Untested...

UPDATE TheAnonymousTable
   SET duplicate = TRUE
 WHERE rowid NOT IN
       (SELECT rowid
          FROM (SELECT MIN(created_date) AS created_date, fingerprint
                  FROM TheAnonymousTable
                 GROUP BY fingerprint
               ) AS M,
               TheAnonymousTable AS T
         WHERE M.created_date = T.created_date
           AND M.fingerprint  = T.fingerprint
       );

The logic is that the innermost query returns the earliest created_date for each distinct fingerprint as table alias M. The middle query determines the rowid value for each of those rows; it is a nuisance to have to do this (but necessary), and the code assumes that you won't get two records for the same fingerprint and timestamp. This gives you the rowid for the earlist record for each separate fingerprint. Then the outer query (the UPDATE) sets the 'duplicate' flag on all those rows where the rowid is not one of the earliest rows.

Some DBMS may be unhappy about doing (nested) sub-queries on the table being updated.

Jonathan Leffler 2009-05-09 01:06:04

Answer 7

A:

If you are only interested in removing duplicates, you can use this technique:

Create a duplicate of your table (structure only -- no data)
Perform an INSERT-SELECT

(This type of group by works in MySQL but may or may not work in others.)

INSERT INTO table_copy (
  rowid,
  fingerprint,
  duplicate,
  contents,
  created_date
)
SELECT 
  rowid,
  fingerprint,
  0,
  contents,
  MAX(created_date)
FROM table_original
GROUP BY fingerprint

Salman A 2009-05-09 08:39:04

ansaurus

tags:

views:

answers:

MySQL -- mark all but 1 matching row

related questions