ansaurus

Question

Fast way to eyeball possible duplicate rows in a table?

Answer 1

A:

select count distinct ....

Should show you without having to guess. You can get your columns by viewing your table definition so you can copy/paste your non-sequence columns.

No Refunds No Returns 2009-11-25 04:06:46

I did try using count distinct earlier - what kind of black magic do I need to use to get it to work with multiple columns? When I try "SELECT COUNT(DISTINCT Column1, Column2, ...) FROM Table" I get "Incorrect syntax near ','."

Margaret 2009-11-25 04:17:33

@Margaret: COUNT doesn't support 2+ columns

OMG Ponies 2009-11-25 04:18:38

replace the ... with your columnsselect count distinct a, b, c <rest of your criteria here>

No Refunds No Returns 2009-11-25 14:18:01

Answer 2

+1 A:

Use:

  SELECT DISTINCT t.*
    FROM TABLE t
ORDER BY t.episode --, and whatever other columns

DISTINCT is just shorthand for writing a GROUP BY with all the columns involved. Grouping by all the columns will show you all the unique groups of records associated with the episode column in this case. So there's a risk of not having an accurate count of duplicates, but you will have the values so you can decide what to remove when you get to that point.

50 columns is a lot, but setting the ORDER BY will allow you to eyeball the list. Another alternative would be to export the data to Excel if you don't want to construct the ORDER BY, and use Excel's sorting.

UPDATE I didn't catch that the sequence column would be a unique value, but in that case you'd have to provide a list of all the columns you want to see. IE:

  SELECT DISTINCT t.episode, t.column1, t.column2 --etc.
    FROM TABLE t
ORDER BY t.episode --, and whatever other columns

There's no notation that will let you use t.* but not this one column. Once the sequence column is omitted from the output, the duplicates will become apparent.

OMG Ponies 2009-11-25 04:17:35

But won't SELECT DISTINCT get confused by the Sequence column, like I was saying?

Margaret 2009-11-25 04:22:49

Get confused? Now I'm confused. `DISTINCT *` is just a synonym for `GROUP BY [all the columns in your query]`

OMG Ponies 2009-11-25 04:36:30

The point was that each row *is* distinct - the Sequence column I mentioned ensures that. This is, at least partially, the source of the issue - the row might be otherwise identical, but the SELECT DISTINCT won't detect that because the (unique) Sequence value is in there.

Margaret 2009-11-25 04:53:23

Answer 3

A:

I think something like this is what you want:

select *
from t
where t.episode in (select episode from t group by episode having count(episode) > 1)
order by episode

This will give all rows that have episodes that are duplicated. Non-duplicate rows should stick out fairly obviously.

Of course, if you have access to some sort of scripting, you could just write a script to generate your query for you. It seems pretty straight-forward. (i.e. describe t and iterate over all the fields).

Also, your query should have some sort of ordering, like FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode AND a.Sequence < b.Sequence, otherwise you'll get duplicate non-duplicates.

FryGuy 2009-11-25 04:33:45

But the OP *knows* there are duplicate episode values - the question is how to get a list to determine what duplicate to keep or not.

OMG Ponies 2009-11-25 04:37:53

Answer 4

A:

A relatively simple solution that Ponies sparked:

SELECT  t.*
FROM    Table t
    INNER JOIN ( SELECT episode
                 FROM   Table
                 GROUP BY Episode
                 HAVING COUNT(*) > 1
               ) AS x ON t.episode = x.episode

And then, copy-paste into Excel, and use this as conditional highlighting for the entire result set:

=AND($C2=$C1,A2<>A1)

Column C is Episode. This way, you get a visual highlight when the data's different from the row above (as long as both rows have the same value for episode).

Margaret 2009-11-25 04:39:44

Answer 5

+1 A:

Instead of typing out all 50 columns, you could do this:

select column_name from information_schema.columns where table_name = 'your table name'

then paste them into a query that groups by all of the columns EXCEPT sequence, and filters by count > 1:

select 
  count(episode)
, col1
, col2
, col3
, ...
from YourTable
group by
  col1
, col2
, col3
, ...
having count(episode) > 1

This should give you a list of all the rows that have the same episode number. (But just neither the sequence nor episode numbers themselves). Here's the rub: you will need to join this result set to YourTable on ALL the columns except sequence and episode since you don't have those columns here.

Here's where I like to use SQL to generate more SQL. This should get you started:

select 't1.' + column_name + ' = t2.' + column_name
from information_schema.columns where table_name = 'YourTable'

You'll plug in those join parameters to this query:

select * from YourTable t1 
inner join (
select 
      count(episode) 'epcount'
    , col1
    , col2
    , col3
    , ...
    from YourTable
    group by
      col1
    , col2
    , col3
    , ...
    having count(episode) > 1
) t2 on 

...plug in all those join parameters here...

Jeff Meatball Yang 2009-11-25 06:25:14

Answer 6

A:

Generate and store a hash key for each row, designed so the hash values mirror your definition of sameness. Depending on the complexity of your rows, updating the hash might be a simple trigger on modifying the row.

Query for duplicates of the hash key, which are your "very probably" identical rows.

ddyer 2009-11-25 06:31:41

ansaurus

tags:

views:

answers:

Fast way to eyeball possible duplicate rows in a table?

related questions