views:

180

answers:

6

Similar: http://stackoverflow.com/questions/91784

I have a feeling this is impossible and I'm going to have to do it the tedious way, but I'll see what you guys have to say.

I have a pretty big table, about 4 million rows, and 50-odd columns. It has a column that is supposed to be unique, Episode. Unfortunately, Episode is not unique - the logic behind this was that occasionally other fields in the row change, despite Episode being repeated. However, there is an actually unique column, Sequence.

I want to try and identify rows that have the same episode number, but something different between them (aside from sequence), so I can pick out how often this occurs, and whether it's worth allowing for or I should just nuke the rows and ignore possible mild discrepancies.

My hope is to create a table that shows the Episode number, and a column for each table column, identifying the value on both sides, where they are different:

SELECT Episode, 
       CASE WHEN a.Value1<>b.Value1 
            THEN a.Value1 + ',' + b.Value1 
            ELSE '' END AS Value1,
       CASE WHEN a.Value2<>b.Value2 
            THEN a.Value2 + ',' + b.Value2 
            ELSE '' END AS Value2
FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode
WHERE a.Value1<>b.Value1
      OR a.Value2<>b.Value2

(That is probably full of holes, but the idea of highlighting changed values comes through, I hope.)

Unfortunately, making a query like that for fifty columns is pretty painful. Obviously, it doesn't exactly have to be rock-solid if it will only be used the once, but at the same time, the more copy-pasta the code, the more likely something will be missed. As far as I know, I can't just do a search for DISTINCT, since Sequence is distinct and the same row will pop up as different.

Does anyone have a query or function that might help? Either something that will output a query result similar to the above, or a different solution? As I said, right now I'm not really looking to remove the duplicates, just identify them.

A: 
select count distinct ....

Should show you without having to guess. You can get your columns by viewing your table definition so you can copy/paste your non-sequence columns.

No Refunds No Returns
I did try using count distinct earlier - what kind of black magic do I need to use to get it to work with multiple columns? When I try "SELECT COUNT(DISTINCT Column1, Column2, ...) FROM Table" I get "Incorrect syntax near ','."
Margaret
@Margaret: COUNT doesn't support 2+ columns
OMG Ponies
replace the ... with your columnsselect count distinct a, b, c <rest of your criteria here>
No Refunds No Returns
+1  A: 

Use:

  SELECT DISTINCT t.*
    FROM TABLE t
ORDER BY t.episode --, and whatever other columns

DISTINCT is just shorthand for writing a GROUP BY with all the columns involved. Grouping by all the columns will show you all the unique groups of records associated with the episode column in this case. So there's a risk of not having an accurate count of duplicates, but you will have the values so you can decide what to remove when you get to that point.

50 columns is a lot, but setting the ORDER BY will allow you to eyeball the list. Another alternative would be to export the data to Excel if you don't want to construct the ORDER BY, and use Excel's sorting.

UPDATE I didn't catch that the sequence column would be a unique value, but in that case you'd have to provide a list of all the columns you want to see. IE:

  SELECT DISTINCT t.episode, t.column1, t.column2 --etc.
    FROM TABLE t
ORDER BY t.episode --, and whatever other columns

There's no notation that will let you use t.* but not this one column. Once the sequence column is omitted from the output, the duplicates will become apparent.

OMG Ponies
But won't SELECT DISTINCT get confused by the Sequence column, like I was saying?
Margaret
Get confused? Now I'm confused. `DISTINCT *` is just a synonym for `GROUP BY [all the columns in your query]`
OMG Ponies
The point was that each row *is* distinct - the Sequence column I mentioned ensures that. This is, at least partially, the source of the issue - the row might be otherwise identical, but the SELECT DISTINCT won't detect that because the (unique) Sequence value is in there.
Margaret
A: 

I think something like this is what you want:

select *
from t
where t.episode in (select episode from t group by episode having count(episode) > 1)
order by episode

This will give all rows that have episodes that are duplicated. Non-duplicate rows should stick out fairly obviously.

Of course, if you have access to some sort of scripting, you could just write a script to generate your query for you. It seems pretty straight-forward. (i.e. describe t and iterate over all the fields).

Also, your query should have some sort of ordering, like FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode AND a.Sequence < b.Sequence, otherwise you'll get duplicate non-duplicates.

FryGuy
But the OP *knows* there are duplicate episode values - the question is how to get a list to determine what duplicate to keep or not.
OMG Ponies
A: 

A relatively simple solution that Ponies sparked:

SELECT  t.*
FROM    Table t
    INNER JOIN ( SELECT episode
                 FROM   Table
                 GROUP BY Episode
                 HAVING COUNT(*) > 1
               ) AS x ON t.episode = x.episode

And then, copy-paste into Excel, and use this as conditional highlighting for the entire result set:

=AND($C2=$C1,A2<>A1)

Column C is Episode. This way, you get a visual highlight when the data's different from the row above (as long as both rows have the same value for episode).

Margaret
+1  A: 

Instead of typing out all 50 columns, you could do this:

select column_name from information_schema.columns where table_name = 'your table name'

then paste them into a query that groups by all of the columns EXCEPT sequence, and filters by count > 1:

select 
  count(episode)
, col1
, col2
, col3
, ...
from YourTable
group by
  col1
, col2
, col3
, ...
having count(episode) > 1

This should give you a list of all the rows that have the same episode number. (But just neither the sequence nor episode numbers themselves). Here's the rub: you will need to join this result set to YourTable on ALL the columns except sequence and episode since you don't have those columns here.

Here's where I like to use SQL to generate more SQL. This should get you started:

select 't1.' + column_name + ' = t2.' + column_name
from information_schema.columns where table_name = 'YourTable'

You'll plug in those join parameters to this query:

select * from YourTable t1 
inner join (
select 
      count(episode) 'epcount'
    , col1
    , col2
    , col3
    , ...
    from YourTable
    group by
      col1
    , col2
    , col3
    , ...
    having count(episode) > 1
) t2 on 

...plug in all those join parameters here...
Jeff Meatball Yang
A: 

Generate and store a hash key for each row, designed so the hash values mirror your definition of sameness. Depending on the complexity of your rows, updating the hash might be a simple trigger on modifying the row.

Query for duplicates of the hash key, which are your "very probably" identical rows.

ddyer