ansaurus

Question

MySQL duplicates -- how to specify when two records actually AREN'T duplicates?

Answer 1

+1 A:

Interesting problem. Here's my crack at it.

How about if we approach the problem from a slightly different perspective.

Consider that the system is clean for a start i.e all records currently in the system are either with Unique First + Last name combinations OR the same first + last name ones have already been manually confirmed to be different people.

At the point of entering a NEW user in the system, we have an additional check. Can be implemented as an INSERT Trigger or just another procedure called after the insert is successfully done.

This Trigger / Procedure matches the FIRST + LAST name combination of "Inserted"record with all existing records in the table.
For all the matching First + Last names, it will create an entry in a matching table (new table) with NewUserID, ExistingMatchingRecordsUserID

From an SQL perspective,

TABLE MatchingTable
COLUMNS 1. NewUserID 2. ExistingUserID
Constraint : Logical PK = NewUserID + ExistingMatchingRecordsUserID

INSERT INTO MATCHINGTABLE VALUES ('NewUserId', userId)
SELECT userId FROM User  u where u.firstName = 'John' and u.LastName = 'Doe'

All entries in MatchingTable need resolution.

When say an Admin logs into the system, the admin sees the list of all entries in MatchingTable

eg: New User John Doe - (ID 345) - 3 Potential matches John Doe - ID 123 ID 231 / ID 256

The admin will check up data for 345 against data in 123 / 231 and 256 and manually confirm if duplicate of ANY / None If Duplicate, 345 is deleted from User Table (soft / hard delete - whatever suits you) If NOT, the entries for ID 354 are just removed from MatchingTable (i would go with hard deletes here as this is like a transactional temp table but again anything is fine).

Additionally, when entries for ID 354 are removed from MatchingTable, all other entries in MatchingTable where ExistingMatchingRecordsUserID = 354 are automatically removed to ensure that unnecessary manual verification for already verified data is not needed.

Again, this could be a potential DELETE trigger / Just logic executed additionally on DELETE of MatchingTable. The implementation is subject to preference.

InSane 2010-09-18 06:13:40

This solution worked out perfectly for my needs. I honestly don't think I would have found my way to this solution if you hadn't outlined it so well. Thank you!

pbarney 2010-09-30 01:44:16

Answer 2

A:

At the expense of adding a single byte per row to your table, you could add a manually_verified BOOL column, with a default of FALSE. Set it to TRUE if you have manually verified the data. Then you can simply query where manually_verified = FALSE.

It's simple, effective, and matches what is actually happening in the business processes: you manually verify the data.

If you want to go a step further, you might want to store when the row was verified and who verified it. Since this might be annoying to store in the main table, you could certainly store it in a separate table, and LEFT JOIN in the verification data. You could even create a view to recreate the appearance of a single master table.

To solve the problem of a new duplicate being added: you would check non-verified data against the entire data set. So that means your main table, c1, would have the condition manually_verified = FALSE, but your INNER JOINed table, c2, does not. This way, the unverified data will still find all potential duplicate matches:

SELECT * FROM table t1
INNER JOIN table t2 ON t1.name = t2.name AND t1.id <> t2.id
WHERE t1.manually_verified = FALSE

The possible matches for the duplicates will be in the joined table.

wuputah 2010-09-18 06:41:04

I like your approach, but using the modified query you suggest shows only the new, unverified record, and not any records that it appears to be a duplicate of. Otherwise, I think this method would have worked.

pbarney 2010-09-19 03:52:44

Well, you have to run a slightly different query...

wuputah 2010-09-19 05:29:48

Added the query to the end of the answer.

wuputah 2010-09-20 00:19:04

ansaurus

tags:

views:

answers:

MySQL duplicates -- how to specify when two records actually AREN'T duplicates?

related questions