views:

372

answers:

6

I've seen a couple of solutions for this, but I'm wondering what the best and most efficient way is to de-dupe a table. You can use code (SQL, etc.) to illustrate your point, but I'm just looking for basic algorithms. I assumed there would already be a question about this on SO, but I wasn't able to find one, so if it already exists just give me a heads up.

(Just to clarify - I'm referring to getting rid of duplicates in a table that has an incremental automatic PK and has some rows that are duplicates in everything but the PK field.)

A: 

You could generate a hash for each row (excluding the PK), store it in a new column (or if you can't add new columns, can you move the table to a temp staging area?), and then look for all other rows with the same hash. Of course, you would have to be able to ensure that your hash function doesn't produce the same code for different rows.

If two rows are duplicate, does it matter which you get rid of? Is it possible that other data are dependent on both of the duplicates? If so, you will have to go through a few steps:

  • Find the dupes
  • Choose one of them as dupeA to eliminate
  • Find all data dependent on dupeA
  • Alter that data to refer to dupeB
  • delete dupeA.

This could be easy or complicated, depending on your existing data model.

This whole scenario sounds like a maintenance and redesign project. If so, best of luck!!

FrustratedWithFormsDesigner
how exactly do you ensure that hashes don't produce the same code for different data. When using hashes there is always a possibility for dupes. Downvote.
Esben Skov Pedersen
@EsbenP: I've seen zipped strings used as a sort of hash for the original strings. At least you don't get dupes.
FrustratedWithFormsDesigner
I was assuming that no other data is dependent on the PK field. But that's an interesting point - what would you do if that was so? That would make it a whole lot more complicated... Any ideas?
froadie
why bother hashing when you can use distinct.
Esben Skov Pedersen
A: 

For SQL, you may use the INSERT IGNORE INTO table SELECT xy FROM unkeyed_table;

For an algorithm, if you can assume that to-be-primary keys may be repeated, but a to-be-primary-key uniquely identifies the content of the row, than hash only the to-be-primary key and check for repetition.

ron
A: 

I think this should require nothing more then just grouping by all columns except the id and choosing one row from every group - for simplicity just the first row, but this does not actually matter besides you have additional constraints on the id.

Or the other way around to get rid of the rows ... just delete all rows accept a single one from all groups.

Daniel Brückner
+4  A: 

Removing Duplicate Records:

http://www.databasejournal.com/features/mssql/article.php/1438651/Removing-Duplicate-Records.htm

How to remove duplicate rows from a table in SQL Server:

http://support.microsoft.com/kb/139444

Remove duplicate entries / rows a mySQL database table:

http://www.justin-cook.com/wp/2006/12/12/remove-duplicate-entries-rows-a-mysql-database-table/

Removing Duplicates from a Table in SQL Server:

http://www.simple-talk.com/sql/t-sql-programming/removing-duplicates-from-a-table-in-sql-server/

That should get you going, if you need more examples please let me know.

James Campbell
I'm bookmarking these!
FrustratedWithFormsDesigner
A: 

SELECT DISTINCT <insert all columns but the PK here> FROM foo. Create a temp table using that query (the syntax varies by RDBMS but there's typically a SELECT … INTO or CREATE TABLE AS pattern available), then blow away the old table and pump the data from the temp table back into it.

Hank Gay
+2  A: 

Deduping is rarely simple. That's because the records to be dedupped often have slightly differnt values is some of the fields. Therefore choose which record to keep can be problematic. Further, dups are often people records and it is hard to identify if the two John Smith's are two people or one person who is duplicated. So spend a lot (50% or more of the whole project) of your time defining what constitutes a dup and how to handle the differences and child records.

How do you know which is the correct value? Further dedupping requires that you handle all child records not orphaning any. What happens when you find that by changing the id on the child record you are suddenly violating one of the unique indexes or constraints - this will happen eventually and your process needs to handle it. If you have chosen foolishly to apply all your constraints only thorough the application, you may not even know the constraints are violated. When you have 10,000 records to dedup, you aren't going to go through the application to dedup one at a time. If the constraint isn't in the database, lots of luck in maintaining data integrity when you dedup.

A further complication is that dups don't always match exactly on the name or address. For instance a salesrep named Joan Martin may be a dup of a sales rep names Joan Martin-Jones especially if they have the same address and email. OR you could have John or Johnny in the name. Or the same street address except one record abbreveiated ST. and one spelled out Street. In SQL server you can use SSIS and fuzzy grouping to also identify near matches. These are often the most common dups as the fact that weren't exact matches is why they got put in as dups in the first place.

For some types of dedupping, you may need a user interface, so that the person doing the dedupping can choose which of two values to use for a particular field. This is especially true if the person who is being dedupped is in tweo or more roles. It could be that the data for a particular role is usually better than the data for another role. Or it could be that only the users will know for sure which is the correct value or they may need to contact people tofind out if they are genuinely dups or simply two people with the same name.

HLGEM