views:

162

answers:

8

If I have a table with important 2 columns,

CREATE TABLE foo (id INT, a INT, b INT, KEY a, KEY b);

How can I find all the rows that have both a and b being the same in both rows? For example, in this data set

id | a | b
----------
1  | 1 | 2
2  | 5 | 42
3  | 1 | 42
4  | 1 | 2 
5  | 1 | 2
6  | 1 | 42

I want to get back all rows except for id=2 since it is unique in (a,b). Basically, I want to find all offending rows that would stop a

ALTER TABLE foo ADD UNIQUE (a, b);

Something better than an n^2 for loop would be nice since my table has 10M rows.

For bonus points : How do I removed all but one of the rows (I don't care which ones, as long as one is left)

+1  A: 
select * from foo where a = b

Or am I missing something?

===

Update for clarity:

select * from 
foo as a
inner join foo as b
on a.a = b.a AND b.a = b.b
and a.id != b.id

++++++++++ After 3rd clarity edit:

select f1.id
FROM foo as f1
INNER JOIN foo as f2
ON f1.a = f2.a AND f1.b=f2.b AND f1.id != f2.id

But I'm shot, so check it yourself.

timdev
updated question since it wasn't clear
Paul Tarjan
A: 

shouldn't this work?

SELECT * FROM foo WHERE a = b

=== edit ===

the how about

SELECT a, b FROM foo GROUP BY a, b HAVING COUNT(*) > 1

=== final re-edit before i give up on this question ===

SELECT foo.* FROM foo, (
   SELECT a, b FROM foo GROUP BY a, b HAVING COUNT(*) > 1
) foo2
WHERE foo.a = foo2.a AND foo.b = foo2.b
Lukman
updated question since it wasn't clear
Paul Tarjan
+1  A: 
SELECT * 
FROM foo first
JOIN foo second
  ON ( first.a = second.a
       AND first.b = second.b ) 
  AND (first.id <> second.id )

Should come up with all the rows where more that one row has the same combination of a and b.

Just hope you have an index on columns a and b.

James Anderson
Paul - not to sound like a complete reputation hooker, but why accept an answer that doesn't actually answer what you stated was your ultimate goal? :)
DVK
i sense this query will have a lots of duplicates. not a good query ...
Lukman
Just change last predicate to (first.id > second.id) will get rid of the duplicates. This was answered before the OP clarified what the really wanted so I left it simple.
James Anderson
+1  A: 

Could you please clarify what you need to do ultimately? The best solution may depend on that (e.g., do you simply want to delete all dupliucate-key rows?)

One way is to handle this table (not sure if mySQL supports it, it's from SYBASE) if all you want is unique-keyed rows:

SELECT MIN(id), A, B FROM FOO GROUP BY A, B HAVING COUNT(*)>1

Your exact question (although I'm a bit at a loss as to why you'd need all rows except id=2) is:

SELECT F1.*  
FROM FOO F1 , 
     (SELECT A, B FROM FOO GROUP BY A, B HAVING COUNT(*)>1) F2
WHERE F1.A=F2.A and F1.B=F2.B

To delete all the duplicates, you can for example do

DELETE FOO WHERE NOT EXISTS
(SELECT 1 from
    (SELECT MIN(id) 'min_id' FROM FOO GROUP BY A, B HAVING COUNT(*)>1) UINIQUE_IDS 
 WHERE id = min_id)

As an alternative, you can do

  SELECT MIN(id) 'id', A, B INTO TEMPDB..NEW_TABLE 
  FROM FOO GROUP BY A, B HAVING COUNT(*)>1

  TRUNCATE TABLE FOO
  // Drop indices on FOO
  INSERT FOO SELECT * FROM NEW_TABLE
  // Recreate indices on FOO
DVK
My ultimate goal is to remove all the duplicate rows so I can add the UNIQUE constraint.
Paul Tarjan
@DVK Sadly your query didn't return within 15 minutes on my database so I couldn't evaluate whether it worked. It is a MyISAM table and locked the whole thing up so I didn't want to keep the site down for much longer than 15 mins. The accepted one I can do OFFSET and LIMIT on to chunk the request. I actually combined your solutions to do the temp table using the accepted answer, but I don't have enough rep to edit the answer.
Paul Tarjan
A: 

here's another approach

select * from foo f1 where exists(
  select * from foo f2 where
    f1.id != f2.id and
    f1.a = f2.a and
    f1.b = f2.b )

anyway, even though I find it a bit more readable, if you have such a huge table, you should check the execution plan, subqueries have a bad reputation involving performance...

you should also consider creating the index (without the unique clause, obviously) to speed up the query... for huge operations, sometimes it's better to spend the time creating the index, perform the update and then drop the index... in this case, I guess an index on (a, b) should certainly help a lot...

opensas
A: 

Try this:

    With s as (Select a,b from foo group by a,b having Count(1)>1)
Select foo.* from foo,s where foo.a=s.a and foo.b=s.b

This query should show duplicate rows in the table foo.

Himadri
That would work on DB2, SQL Server 2005+, or Oracle 9i+ - but sadly, not MySQL.
OMG Ponies
Yes. I have written this query in sql server 2005.
Himadri
A: 

Your stated goal is to remove all duplicate combination of (a,b). For that, you can use a multi-table DELETE:

DELETE t1
  FROM foo t1
  JOIN foo t2 USING (a, b)
 WHERE t2.id > t1.id

Before you run it, you can check which rows will be removed with:

SELECT DISTINCT t1.id
  FROM foo t1
  JOIN foo t2 USING (a, b)
 WHERE t2.id > t1.id

The WHERE clause being t2.id > t1.id it will remove all but the one with the highest value for id. In your case, only the rows with id equal to 2, 5 or 6 would remain.

Josh Davis
A: 

If the id value doesn't matter at all in the final product, that is, if you could renumber them all and it would be fine, and if id is a serial column, then just "select distinct" on the two columns into a new table, delete all the data from the old table, and then copy the temporary values back in.

Kev