ansaurus

Question

Answer 1

+6 A:

Will return all records that have dups:

SELECT theTable.*
FROM theTable
INNER JOIN (
  SELECT link, size
  FROM theTable 
  GROUP BY link, size
  HAVING count(ID) > 1
) dups ON theTable.link = dups.link AND theTable.size = dups.size

I like the subquery b/c I can do things like select all but the first or last. (very easy to turn into a delete query then).

Example: select all duplicate records EXCEPT the one with the max ID:

SELECT theTable.*
FROM theTable
INNER JOIN (
  SELECT link, size, max(ID) as maxID
  FROM theTable 
  GROUP BY link, size
  HAVING count(ID) > 1
) dups ON theTable.link = dups.link 
          AND theTable.size = dups.size 
          AND theTable.ID <> dups.maxID

Lance Rushing 2009-10-15 18:05:46

Answer 2

+1 A:

Assuming that none of id, link or size can be NULL, and id field is the primary key. This gives you the id's of duplicate rows. Beware that same id can be in the results several times, if there are three or more rows with identical link and size values.

select a.id, b.id 
from tbl a, tbl b  
where a.id < b.id   
  and a.link = b.link  
  and a.size = b.size

Juha Syrjälä 2009-10-15 18:07:48

Answer 3

A:

If you want to do it exclusively in SQL, some kind of self-join of the table (on equality of link and size) is required, and can be accompanied by different kinds of elaboration. Since you mention Python as well, I assume you want to do the processing in Python; in that case, simplest is to build an iterator on a 'SELECT * FROM thetable ORDER BY link, size, and process with itertools.groupby using, as key, the operator.itemgetter` for those two fields; this will present natural groupings of each bunch of 1+ rows with identical values for the fields in question.

I can elaborate on either option if you clarify where you want to do your processing and ideally provide an example of the kind of processing you DO want to perform!

Alex Martelli 2009-10-15 18:09:11

I want to find the rows that are "duplicates" based on certain attributes. Then I want to calculate the "important" row, and remove/update the score of the "duplicate" row!WOW, you wrote Python Cookbook!!!??? I have that ON MY DESK RIGHT NOW

TIMEX 2009-10-15 18:33:25

Answer 4

+1 A:

After you remove the duplicates from the MySQL table, you can add a unique index to the table so no more duplicates can be inserted:

create unique index theTable_index on theTable (link,size);

unutbu 2009-10-15 19:02:04

ansaurus

tags:

views:

answers:

How to find duplicates in MySQL

related questions