views:

1561

answers:

10

Ok I have a table that has redundant data and I'm trying to identify all rows that have duplicate sub-rows (for lack of a better word). By sub-rows I mean considering COL1 and COL2 only.

So let's say I have something like this:

 COL1   COL2   COL3
 ---------------------
 aa     111    blah_x
 aa     111    blah_j
 aa     112    blah_m
 ab     111    blah_s
 bb     112    blah_d
 bb     112    blah_d
 cc     112    blah_w
 cc     113    blah_p

I need a SQL query that returns this:

 COL1   COL2   COL3
 ---------------------
 aa     111    blah_x
 aa     111    blah_j
 bb     112    blah_d
 bb     112    blah_d
+4  A: 

Join on yourself like this:

SELECT a.col3, b.col3, a.col1, a.col2 
FROM tablename a, tablename b
WHERE a.col1 = b.col1 AND a.col2 = b.col2 AND a.col3 != b.col3

If you're using postgresql, you can use the oid to make it return less duplicated results, like this:

SELECT a.col3, b.col3, a.col1, a.col2 
FROM tablename a, tablename b
WHERE a.col1 = b.col1 AND a.col2 = b.col2 AND a.col3 != b.col3
  AND a.oid < b.oid
Jerub
+2  A: 

Don't have a database handy to test this, but I think it should work...

select
  *
from
  theTable
where
  col1 in
    (
    select
      col1
    from
      theTable
    group by
      col1||col2
    having
      count(col1||col2) > 1
    )
dacracot
This fails on SQL Server because 'col1' isn't present in the GROUP BY clause. I'm pretty sure this will fail on most other SQL databases.
Craig Trader
+2  A: 

My naive attempt would be

select a.*, b.* from table a, table b where a.col1 = b.col1 and a.col2 = b.col2 and a.col3 != b.col3;

but that would return all the rows twice. I'm not sure how you'd restrict it to just returning them once. Maybe if there was a primary key, you could add "and a.pkey < b.pkey".

Like I said, that's not elegant and there is probably a better way to to do this.

Paul Tomblin
+5  A: 

With the data you have listed, your query is not possible. The data on rows 5 & 6 is not distinct within itself.

Assuming that your table is named 'quux', if you start with something like this:

SELECT a.COL1, a.COL2, a.COL3 
FROM quux a, quux b
WHERE a.COL1 = b.COL1 AND a.COL2 = b.COL2 AND a.COL3 <> b.COL3
ORDER BY a.COL1, a.COL2

You'll end up with this answer:

 COL1   COL2   COL3
 ---------------------
 aa     111    blah_x
 aa     111    blah_j

That's because rows 5 & 6 have the same values for COL3. Any query that returns both rows 5 & 6 will also return duplicates of ALL of the rows in this dataset.

On the other hand, if you have a primary key (ID), then you can use this query instead:

SELECT a.COL1, a.COL2, a.COL3
FROM quux a, quux b
WHERE a.COL1 = b.COL1 AND a.COL2 = b.COL2 AND a.ID <> b.ID
ORDER BY a.COL1, a.COL2

[Edited to simplify the WHERE clause]

And you'll get the results you want:

COL1   COL2   COL3
---------------------
aa     111    blah_x
aa     111    blah_j
bb     112    blah_d
bb     112    blah_d

I just tested this on SQL Server 2000, but you should see the same results on any modern SQL database.

blorgbeard proved me wrong -- good for him!

Craig Trader
+6  A: 

Does this work for you?

select t.* from table t
left join ( select col1, col2, count(*) as count from table group by col1, col2 ) c on t.col1=c.col1 and t.col2=c.col2
where c.count > 1
Blorgbeard
This is a correct answer. I think mine will run a hair faster on a large database, but I'd leave that up to a DBA to decide.
Craig Trader
Left join is not needed due to criteria on the right side.
David B
Looks slower than a solution based on an analytic function to me.
David Aldridge
+2  A: 

Something like this should work:

SELECT a.COL1, a.COL2, a.COL3
FROM YourTable a
JOIN YourTable b ON b.COL1 = a.COL1 AND b.COL2 = a.COL2 AND b.COL3 <> a.COL3

In general, the JOIN clause should include every column that you're considering to be part of a "duplicate" (COL1 and COL2 in this case), and at least one column (or as many as it takes) to eliminate a row joining to itself (COL3, in this case).

Jonathan Schuster
+2  A: 

This is pretty similar to the self-join, except it will not have the duplicates.

select COL1,COL2,COL3
from theTable a
where exists (select 'x'
              from theTable b
              where a.col1=b.col1
              and   a.col2=b.col2
              and   a.col3<>b.col3)
order by col1,col2,col3
IK
A: 

select COL1,COL2,COL3

from table

group by COL1,COL2,COL3

having count(*)>1

This does not work. Examine the blah_x row in the question to understand why.
David B
A: 

Forget joins -- use an analytic function:

select col1, col2, col3
from
(
select col1, col2, col3, count(*) over (partition by col1, col2) rows_per_col1_col2
from table
)
where rows_per_col1_col2 > 1
David Aldridge
That only works if your database supports it. SQL Server 2005 does, and presumably Oracle does. SQL Server 2000 does not, nor does MySQL or PostgresQL.
Craig Trader
Ah, something new to learn. Should there be another from clause in this?
David B
Heh, yes of course. Thanks.
David Aldridge
+1  A: 

Here is how you find duplicates. Tested in oracle 10g with your data.

select * from tst where (col1, col2) in (select col1, col2 from tst group by col1, col2 having count(*) > 1)

Kyle Dyer