views:

71

answers:

2

I have an interesting problem, and my logic isn't up to the task.

We have a table with that sometimes develops duplicate records (for process reasons, and this is unavoidable). Take the following example:

id  FirstName  LastName  PhoneNumber   email
--  ---------  --------  ------------  --------------
 1  John       Doe       123-555-1234  [email protected]
 2  Jane       Smith     123-555-1111  [email protected]
 3  John       Doe       123-555-4321  [email protected]
 4  Bob        Jones     123-555-5555  [email protected]
 5  John       Doe       123-555-0000  [email protected]
 6  Mike       Roberts   123-555-9999  [email protected]
 7  John       Doe       123-555-1717  [email protected]

We find the duplicates this way:

SELECT c1.* 
FROM `clients` c1
INNER JOIN (
    SELECT `FirstName`, `LastName`, COUNT(*)
    FROM `clients`
    GROUP BY `FirstName`, `LastName`
    HAVING COUNT(*) > 1
) AS c2
ON c1.`FirstName` = c2.`FirstName`
AND c1.`LastName` = c2.`LastName`

This generates the following list of duplicates:

id  FirstName  LastName  PhoneNumber   email
--  ---------  --------  ------------  --------------
 1  John       Doe       123-555-1234  [email protected]
 3  John       Doe       123-555-4321  [email protected]
 5  John       Doe       123-555-0000  [email protected]
 7  John       Doe       123-555-1717  [email protected]

As you can see, based on FirstName and LastName, all of the records are duplicates.

At this point, we actually make a phone call to the client to clear up potential duplicates.

After doing so, we learn (for example) that records 1 and 3 are real duplicates, but records 5 and 7 are actually two different people altogether.

So we merge any extraneously linked data from records 1 and 3 into record 1, remove record 3, and leave records 5 and 7 alone.

Now here's were the problem comes in:

The next time we re-run the "duplicates" query, it will contain the following rows:

id  FirstName  LastName  PhoneNumber   email
--  ---------  --------  ------------  --------------
 1  John       Doe       123-555-4321  [email protected]
 5  John       Doe       123-555-0000  [email protected]
 7  John       Doe       123-555-1717  [email protected]

They all appear to be duplicates, even though we've previously recognized that they aren't.

How would you go about identifying that these records aren't duplicates?

My first though it to build a lookup table identifying which records aren't duplicates of each other (for example, {1,5},{1,7},{5,7}), but I have no idea how to build a query that would be able to use this data.

Further, if another duplicate record shows up, it may be a duplicate of 1, 5, or 7, so we would need them all to show back up in the duplicates list so the customer service person can call the person in the new record to find out which record he may be a duplicate of.

I'm stretched to the limit trying to understand this. Any brilliant geniuses out there that would care to take a crack at this?

+1  A: 

Interesting problem. Here's my crack at it.

How about if we approach the problem from a slightly different perspective.

Consider that the system is clean for a start i.e all records currently in the system are either with Unique First + Last name combinations OR the same first + last name ones have already been manually confirmed to be different people.

At the point of entering a NEW user in the system, we have an additional check. Can be implemented as an INSERT Trigger or just another procedure called after the insert is successfully done.

  1. This Trigger / Procedure matches the FIRST + LAST name combination of "Inserted"record with all existing records in the table.
  2. For all the matching First + Last names, it will create an entry in a matching table (new table) with NewUserID, ExistingMatchingRecordsUserID

From an SQL perspective,

TABLE MatchingTable
COLUMNS 1. NewUserID 2. ExistingUserID
Constraint : Logical PK = NewUserID + ExistingMatchingRecordsUserID

INSERT INTO MATCHINGTABLE VALUES ('NewUserId', userId)
SELECT userId FROM User  u where u.firstName = 'John' and u.LastName = 'Doe'

All entries in MatchingTable need resolution.

When say an Admin logs into the system, the admin sees the list of all entries in MatchingTable

eg: New User John Doe - (ID 345) - 3 Potential matches John Doe - ID 123 ID 231 / ID 256

The admin will check up data for 345 against data in 123 / 231 and 256 and manually confirm if duplicate of ANY / None If Duplicate, 345 is deleted from User Table (soft / hard delete - whatever suits you) If NOT, the entries for ID 354 are just removed from MatchingTable (i would go with hard deletes here as this is like a transactional temp table but again anything is fine).

Additionally, when entries for ID 354 are removed from MatchingTable, all other entries in MatchingTable where ExistingMatchingRecordsUserID = 354 are automatically removed to ensure that unnecessary manual verification for already verified data is not needed.

Again, this could be a potential DELETE trigger / Just logic executed additionally on DELETE of MatchingTable. The implementation is subject to preference.

InSane
This solution worked out perfectly for my needs. I honestly don't think I would have found my way to this solution if you hadn't outlined it so well. Thank you!
pbarney
A: 

At the expense of adding a single byte per row to your table, you could add a manually_verified BOOL column, with a default of FALSE. Set it to TRUE if you have manually verified the data. Then you can simply query where manually_verified = FALSE.

It's simple, effective, and matches what is actually happening in the business processes: you manually verify the data.

If you want to go a step further, you might want to store when the row was verified and who verified it. Since this might be annoying to store in the main table, you could certainly store it in a separate table, and LEFT JOIN in the verification data. You could even create a view to recreate the appearance of a single master table.

To solve the problem of a new duplicate being added: you would check non-verified data against the entire data set. So that means your main table, c1, would have the condition manually_verified = FALSE, but your INNER JOINed table, c2, does not. This way, the unverified data will still find all potential duplicate matches:

SELECT * FROM table t1
INNER JOIN table t2 ON t1.name = t2.name AND t1.id <> t2.id
WHERE t1.manually_verified = FALSE

The possible matches for the duplicates will be in the joined table.

wuputah
I like your approach, but using the modified query you suggest shows only the new, unverified record, and not any records that it appears to be a duplicate of. Otherwise, I think this method would have worked.
pbarney
Well, you have to run a slightly different query...
wuputah
Added the query to the end of the answer.
wuputah