views:

51

answers:

2

At my worki we have data stored in a database, the data is not normalized. I am looking for a way to find what data was duplicated.

Our Data base has 3 rows columns, Name, State, Strategy

This data might looks something like this:

OldTable:
Name | State | Strat
-----+-------+------
A    | M     | 1
A    | X     | 3
B    | T     | 6
C    | M     | 1
C    | X     | 3
D    | X     | 3

What I'd like to do is move the data to two tables, one containing the name the other containing the set of State and Strats so it would look more like this

NewTable0:
Name | StratID
-----+--------
A    | 1
B    | 2
C    | 1
D    | 3

NewTable1:
StratID | State | Strat
--------+-------+------
1       | M     | 1    
1       | X     | 3
2       | T     | 6
3       | X     | 3

So in the data example A and C would be duplicates, but D would not be. How would I go about finding and/or identifying these duplicates?

+1  A: 

You could find this out by grouping the Names together, and only listing those where there is more than one record:

SELECT OldTable.Name, COUNT(1) Duplicates
FROM OldTable
GROUP BY OldTable.Name
HAVING Duplicates > 1

Should output:

OldTable:
Name | Duplicates
-----+------------
A    | 2
C    | 2
Gus
except that won't catch cases that have the same number of entries, but whose entries differ.
Apeiron
+1  A: 

Try:

SELECT OT1.Name Name1, OT2.Name Name2
FROM OldTable OT1
JOIN OldTable OT2 ON OT1.Name < OT2.Name AND 
                     OT1.State = OT2.State AND 
                     OT1.Strat = OT2.Strat
GROUP BY OT1.Name, OT2.Name 
HAVING COUNT(*) = (SELECT COUNT(*) FROM OldTable TC1 WHERE TC1.NAME = OT1.NAME) 
   AND COUNT(*) = (SELECT COUNT(*) FROM OldTable TC2 WHERE TC2.NAME = OT2.NAME)
Mark Bannister
Very Nice Solution, I tested it with a few sets of data that I knew had duplicates and lo and behold it worked. One thing I don't understand is OT1.Name < OT2.Name vs OT1.Name != OT2.Name perhaps this is just a difference in DBMS?Now I just need to figure this out, I understand the join but not completely the HAVING section.
Apeiron
@Apeiron, I was concerned with the possibility that their might be much more than two names with the same set of states and strats - `OT1.Name < OT2.Name` will return half as many rows as `OT1.Name != OT2.Name`. (So for example, if 4 different names all share the same set of states and strats, the former will return 6 rows [ab, ac, ad, bc, bd, cd] while the latter will return 12 rows [ab, ac, ad, ba, bc, bd, ca, cb, cd, da, db, dc].)
Mark Bannister
The main part of the query finds the number of combinations that all pairs of names have in common, while each of the HAVING conditions compares this with the total number of combinations for each of the names in the pair.
Mark Bannister