tags:

views:

690

answers:

7

This is similar to this question, but it seems like some of the answers there aren't quite compatible with MySQL (or I'm not doing it right), and I'm having a heck of a time figuring out the changes I need. Apparently my SQL is rustier than I thought it was. I'm also looking to change a column value rather than delete, but I think at least that part is simple...

I have a table like:

rowid SERIAL
fingerprint TEXT
duplicate BOOLEAN
contents TEXT
created_date DATETIME

I want to set duplicate=true for all but the first (by created_date) of each group by fingerprint. It's easy to mark all of the rows with duplicate fingerprints as dupes. The part I'm getting stuck on is keeping the first.

One of the apps that populates the table does bulk loads of data, with multiple workers loading data from different sources, and the workers' data isn't necessarily partitioned by date, so it's a pain to try to mark these all as they come in (the first one inserted isn't necessarily the first one by date). Also, I already have a bunch of data in there I'll need to clean up either way. So I'd rather just have a relatively efficient query I can run after a bulk load to clean up than try to build it into that app.

Thanks!

A: 

How about a two-step approach, assuming you can go offline during a data load:

  • Mark every item as duplicate.
  • Select the earliest row from each group, and clear the duplicate flag.

Not elegant, but gets the job done.

Adam Liss
This can easily be accomplished with a single, rather easy query. No reason to go to these lengths to complicate things.
sliderhouserules
A: 

Here's a funny way to do it:

SET @rowid := 0;

UPDATE mytable
SET duplicate = (rowid = @rowid), 
    rowid = (@rowid:=rowid)
ORDER BY rowid, created_date;
  • First set a user variable to zero, assuming this is less than any rowid in your table.
  • Then use the MySQL UPDATE...ORDER BY feature to ensure that the rows are updated in order by rowid, then by created_date.
  • For each row, if the current rowid is not equal to the user variable @rowid, set duplicate to 0 (false). This will be true only on the first row encountered with a given value for rowid.
  • Then add a dummy set of rowid to its own value, setting @rowid to that value as a side effect.
  • As you UPDATE the next row, if it's a duplicate of the previous row, rowid will be equal to the user variable @rowid, and therefore duplicate will be set to 1 (true).

Edit: Now I have tested this, and I corrected a mistake in the line that sets duplicate.

Bill Karwin
A: 

I don't know the MySQL syntax, but in PLSQL you just do:

UPDATE t1
SET duplicate = 1
FROM MyTable t1
WHERE rowid != (
  SELECT TOP 1 rowid FROM MyTable t2
  WHERE t2.fingerprint = t1.fingerprint ORDER BY created_date DESC
)

That may have some syntax errors, as I'm just typing off the cuff/not able to test it, but that's the gist of it.


MySQL version (not tested):

UPDATE t1
  SET duplicate = 1
FROM MyTable t1
WHERE rowid != (
  SELECT rowid FROM MyTable t2
  WHERE t2.fingerprint = t1.fingerprint
  ORDER BY created_date DESC
  LIMIT 1
)
sliderhouserules
SELECT TOP is a Microsoft SQL Server feature. It is not supported in Oracle or MySQL.
Bill Karwin
Just looked up the MySQL syntax, it's LIMIT.
sliderhouserules
A: 

Here's another way to do it, using MySQL's multi-table UPDATE syntax:

UPDATE mytable m1
  JOIN mytable m2 ON (m1.rowid = m2.rowid AND m1.created_date < m2.created_date)
SET m2.duplicate = 1;
Bill Karwin
Doesn't account for duplicate dates...
sliderhouserules
Oh, yes, you're right. It assumes each date is unique. Ah well.
Bill Karwin
True, but you could do m1.primary_key < m2.primary_key. I know the OP said he wanted to keep the first record by creation date - wouldn't the primary keys of the table (we're talking an UNIQUE AUTO INCREMENT field here, since it's MySQL) necessarily be in ascending chronological order?
Chris
@Chris: Yes, *usually* that's true. But you might not be able to assume rows are inserted in chronological order. That is, `m1.primary_key < m2.primary_key` might not guarantee that `m1.created_date < m2.created_date`. YMMV.
Bill Karwin
+1  A: 

MySQL needs to be explicitly told if the data you are grouping by is larger than 1024 bytes (see this link for details). So if your data in the fingerprint column is larger than 1024 bytes you should use set the max_sort_length variable (see this link for details about values allowed, and this link about how to set it) to a larger number so that the group by wont silently use only part of your data for grouping.

Once you're certain that MySQL will group your data properly, the following query will set the duplicate flag so that the first fingerprint record has duplicate set to FALSE/0 and any subsequent fingerprint records have duplicate set to TRUE/1:

    UPDATE mytable m1
INNER JOIN (SELECT fingerprint
                 , MIN(rowid) AS minrow 
              FROM mytable m2 
          GROUP BY fingerprint) m3 
        ON m1.fingerprint = m3.fingerprint
       SET m1.duplicate = m3.minrow != m1.rowid;

Please keep in mind that this solution does not take NULLs into account and if it is possible for the fingerprint field to be NULL then you would need additional logic to handle that case.

Dipin
A: 

Untested...

UPDATE TheAnonymousTable
   SET duplicate = TRUE
 WHERE rowid NOT IN
       (SELECT rowid
          FROM (SELECT MIN(created_date) AS created_date, fingerprint
                  FROM TheAnonymousTable
                 GROUP BY fingerprint
               ) AS M,
               TheAnonymousTable AS T
         WHERE M.created_date = T.created_date
           AND M.fingerprint  = T.fingerprint
       );

The logic is that the innermost query returns the earliest created_date for each distinct fingerprint as table alias M. The middle query determines the rowid value for each of those rows; it is a nuisance to have to do this (but necessary), and the code assumes that you won't get two records for the same fingerprint and timestamp. This gives you the rowid for the earlist record for each separate fingerprint. Then the outer query (the UPDATE) sets the 'duplicate' flag on all those rows where the rowid is not one of the earliest rows.

Some DBMS may be unhappy about doing (nested) sub-queries on the table being updated.

Jonathan Leffler
A: 

If you are only interested in removing duplicates, you can use this technique:

  1. Create a duplicate of your table (structure only -- no data)
  2. Perform an INSERT-SELECT

(This type of group by works in MySQL but may or may not work in others.)

INSERT INTO table_copy (
  rowid,
  fingerprint,
  duplicate,
  contents,
  created_date
)
SELECT 
  rowid,
  fingerprint,
  0,
  contents,
  MAX(created_date)
FROM table_original
GROUP BY fingerprint
Salman A