views:

1847

answers:

3

What is an example of a fast SQL to get duplicates in datasets with hundreds of thousands of records. I typically use something like: select afield1, afield2 from afile a where 1 < (select count(afield1) from afile b where a.afield1 = b.afield1);

But this is quite slow.

+15  A: 

This is the more direct way:

select afield1,count(afield1) from atable 
group by afield1 having count(afield1) > 1
Vinko Vrsalovic
Thanks - I'll try that.
mm2010
+4  A: 

You could try:

select afield1, afield2 from afile a
where afield1 in
( select afield1
  from afile
  group by afield1
  having count(*) > 1
);
Tony Andrews
Thanks - I'll try this too
mm2010
This is actually my preferred way because you can return all columns of the table.
leek
Oddly, 2 people have voted this answer down without commenting on why. I presume this means there is something wrong with it?
Tony Andrews
I'd guess it's slower
Vinko Vrsalovic
Yes, but it shows the information the OP asked for: field1 and field2, which may be necessary to identify which row to keep for example.
Tony Andrews
+2  A: 

A similar question was asked last week. There are some good answers there.

http://stackoverflow.com/questions/182544/sql-to-find-duplicate-entries-within-a-group

In that question, the OP was interested in all the columns (fields) in the table (file), but rows belonged in the same group if they had the same key value (afield1).

There are three kinds of answers:

subqueries in the where clause, like some of the other answers in here.

an inner join between the table and the groups viewed as a table (my answer)

and analytic queries (something that's new to me).

Walter Mitty