tags:

views:

169

answers:

4

Suppose I have many columns. If 2 columns match and are exactly the same, then they are duplicates.

ID | title | link | size | author

Suppose if link and size are similar for 2 rows or more, then those rows are duplicates. How do I get those duplicates into a list and process them?

+6  A: 

Will return all records that have dups:

SELECT theTable.*
FROM theTable
INNER JOIN (
  SELECT link, size
  FROM theTable 
  GROUP BY link, size
  HAVING count(ID) > 1
) dups ON theTable.link = dups.link AND theTable.size = dups.size

I like the subquery b/c I can do things like select all but the first or last. (very easy to turn into a delete query then).

Example: select all duplicate records EXCEPT the one with the max ID:

SELECT theTable.*
FROM theTable
INNER JOIN (
  SELECT link, size, max(ID) as maxID
  FROM theTable 
  GROUP BY link, size
  HAVING count(ID) > 1
) dups ON theTable.link = dups.link 
          AND theTable.size = dups.size 
          AND theTable.ID <> dups.maxID
Lance Rushing
+1  A: 

Assuming that none of id, link or size can be NULL, and id field is the primary key. This gives you the id's of duplicate rows. Beware that same id can be in the results several times, if there are three or more rows with identical link and size values.

select a.id, b.id 
from tbl a, tbl b  
where a.id < b.id   
  and a.link = b.link  
  and a.size = b.size
Juha Syrjälä
A: 

If you want to do it exclusively in SQL, some kind of self-join of the table (on equality of link and size) is required, and can be accompanied by different kinds of elaboration. Since you mention Python as well, I assume you want to do the processing in Python; in that case, simplest is to build an iterator on a 'SELECT * FROM thetable ORDER BY link, size, and process with itertools.groupby using, as key, the operator.itemgetter` for those two fields; this will present natural groupings of each bunch of 1+ rows with identical values for the fields in question.

I can elaborate on either option if you clarify where you want to do your processing and ideally provide an example of the kind of processing you DO want to perform!

Alex Martelli
I want to find the rows that are "duplicates" based on certain attributes. Then I want to calculate the "important" row, and remove/update the score of the "duplicate" row!WOW, you wrote Python Cookbook!!!??? I have that ON MY DESK RIGHT NOW
TIMEX
+1  A: 

After you remove the duplicates from the MySQL table, you can add a unique index to the table so no more duplicates can be inserted:

create unique index theTable_index on theTable (link,size);
unutbu