views:

89

answers:

6

I need to make a column unique in one of our database tables, and we want to completely remove any duplicates from the table. There is however a snag, in that there are a bunch of dependencies to other tables that will be affected.

For example, let’s say we’ve got the following relationship:

-------------------   -------------------
*     Customer    *   *      Order      *
-------------------   -------------------
* ID              *   * ID              *
* Name            *   * CustomerID      *
* Address         *   * Item            *
-------------------   -------------------

Fairly obvious relationship there – an Order needs a CustomerID, and that’s a foreign key. So we can’t delete the customer and keep the Order data.

In the example I’m totally ok with losing the redundant customer data, but for later reference I’d like to change the CustomerID from Order to a Customer ID for a “Deleted Customer”-row.

Is there any way of saying “delete this, and if there are foreign key constraints, change the CustomerID to that ID instead”? The DB is MS SQL 2005.

+1  A: 

AFAIK, it can't be done in one SQL statement.

But it sounds like fair game for a BEFORE DELETE trigger on CUSTOMER.

You'd just have to make sure that the two operations were a single unit of work.

duffymo
+1  A: 

You could use a delete trigger on the customer table. Books online on create trigger.

But why don't you just update the order records before delete customers? It's easier, avoids the pains triggers and IMO would keep the logic in the same place.

Dan Sydner
+1  A: 

Anyway, you have to have two ids: dupCustomerId and newCustomerId, so why don’t you just update the reference first?

UPDATE Order set CustomerID = newCustomerId WHERE CustomerID = dupCustomerId

And then delete the duplicate from the clients table:

DELETE from Customer WHERE ID = dupCustomerId

Or am I missing something here?

Maciej Łebkowski
+4  A: 

I'd say:

step 1: create a list of the duplicate IDs to be deleted along with the corresponding ID to keep. The method really depends on how you're detecting duplicates. Say you get a table:

-------------
*  Dupes    *
-------------
* del_id    *
* keep_id   *
-------------

step 2: relink orders

update order o
set CustomerID=(select keep_id from Dupes d where d.del_id=o.CustomerID)
where CustomerID in (select del_id from Dupes)

step 3: delete the old customers

delete from Customer
where ID in (select del_id from Dupes)

And voilà.

Eric Darchis
+1  A: 

Are you saying that you have duplicate Customer records i.e. the customer details are the same however the Customer ID differs and so there are Orders that reference multiple versions of the same customer?

If so, I would perform a data cleaning exercise.

Create/build a lookup table with the columns

  • CustomerID
  • OrderID
  • PrimaryCustomerID - (needs to be calculated)

You can then perform an update to the Orders table to ensure that each Order references a PrimaryCustomerID only.

You can then delete the Customer records that are no longer referenced by an Order (i.e. they are duplicates). Alternatively you could add an attribute to the Customer table in order to mark the records rather than delete (i.e. duplicateFlag or isDeleted).

Hope this makes sense.

John Sansom
A: 

We have built a dedupping tool (and so should you) that first looks for places where there are data conflicts (two different business phone numbers for instance) and allows the person doing the dedupping to choose the correct data. The tool then changes the id to the one you are keeping, starting at the bottommost child table and working up through all related tables. Once all the references to the record it is deleting are rmoved, it deltes the parent record. Dedupping is usually a complex process and this tool should be carefully designed to handle what needs to be handled and to allow change to the tool as new foreign key tables are added. You can set it up to alawys chosse the information inthe record you are keeping when you have data conflicts, but it is usually a poor idea to do so without manual intervention. This is because you often need the input of someone who knows the customers. Otherwise you may end up replacing a good address with a bad address. Here is a common scenario for how the dup got there in the first place. Customer A has been a customer for awhile and has several orders. He goes to order agin and the order taker asks for his phone number or some other identifying piece of information to help look him up. Customer A has recently moved and has a new phone number and address, so he isn't found and a new record is created. Later it is realized it is a dup, but the automated dedup process picks the older record because it has more orders and thus replaces the record withthe current new address and phone. Customer calls in to order again and another dup is created becasue again the order taker can't find him. This is why I strongly feel dedupping must be a partly manual process.

HLGEM