views:

145

answers:

5

I have a requirement to produce a list of possible duplicates before a user saves an entity to the database and warn them of the possible duplicates.

There are 7 criteria on which we should check the for duplicates and if at least 3 match we should flag this up to the user. The criteria will all match on ID, so there is no fuzzy string matching needed but my problem comes from the fact that there are many possible ways (99 ways if I've done my sums corerctly) for at least 3 items to match from the list of 7 possibles.

I don't want to have to do 99 separate db queries to find my search results and nor do I want to bring the whole lot back from the db and filter on the client side. We're probably only talking of a few tens of thousands of records at present, but this will grow into the millions as the system matures.

Anyone got any thoughs of a nice efficient way to do this? I was considering a simple OR query to get the records where at least one field matches from the db and then doing some processing on the client to filter it some more, but a few of the fields have very low cardinality and won't actually reduce the numbers by a huge amount.

Thanks Jon

+3  A: 

OR and CASE summing will work but are quite inefficient, since they don't use indexes.

You need to make UNION for indexes to be usable.

If a user enters name, phone, email and address into the database, and you want to check all records that match at least 3 of these fields, you issue:

SELECT  i.*
FROM    (
        SELECT  id, COUNT(*)
        FROM    (
                SELECT  id
                FROM    t_info t
                WHERE   name  = 'Eve Chianese'
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   phone = '+15558000042'
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   email = '[email protected]'
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   address = '42 North Lane'
                ) q
        GROUP BY
                id
        HAVING  COUNT(*) >= 3
        ) dq
JOIN    t_info i
ON      i.id = dq.id

This will use indexes on these fields and the query will be fast.

See this article in my blog for details:

  • Matching 3 of 4: how to match a record which matches at least 3 of 4 possible conditions

Also see this question the article is based upon.

If you want to have a list of DISTINCT values in the existing data, you just wrap this query into a subquery:

SELECT  i.*
FROM    t_info i1
WHERE   EXISTS
        (
        SELECT  1
        FROM    (
                SELECT  id
                FROM    t_info t
                WHERE   name  = i1.name
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   phone = i1.phone
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   email = i1.email
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   address = i1.address
                ) q
        GROUP BY
                id
        HAVING  COUNT(*) >= 3
        )

Note that this DISTINCT is not transitive: if A matches B and B matches C, this does not mean that A matches C.

Quassnoi
Thanks, think this one looks like the best solution for my problem.We're still going to have to do a group by on a set with many elements, but doing a bit of testing this seems to be quicker than other ways I've tried.
JonC
A: 

What DBS are you using? Some support using such constraints by using server side code.

NineBerry
A: 

Have you considered using a stored procedure with a cursor? You could then do your OR query and then step through the records one-by-one looking for matches. Using a stored procedure would allow you to do all the checking on the server.

However, I think a table scan with millions of records is always going to be slow. I think you should work out which of the 7 fields are most likely to match are make sure these are indexed.

Dave Webb
A: 

I'm assuming your system is trying to match tag ids of a certain post, or something similar. This is a multi-to-multi relationship and you should have three tables to handle it. One for the post, one for tags and one for post and tags relationship.

If my assumptions are correct then the best way to handle this is:

SELECT postid, count(tagid) as common_tag_count
FROM posts_to_tags
WHERE tagid IN (tag1, tag2, tag3, ...)
GROUP BY postid
HAVING count(tagid) > 3;
Nadia Alramli
+2  A: 

You might want something like the following:

SELECT id
FROM 
    (select id, CASE fld1 WHEN input1 THEN 1 ELSE 0 "rule1",
        CASE fld2 when input2 THEN 1 ELSE 0 "rule2",
        ...,
        CASE fld7 when input7 THEN 1 ELSE 0 "rule2",
    FROM table)
WHERE rule1+rule2+rule3+...+rule4 >= 3

This isn't tested, but it shows a way to tackle this.

S.Lott