ansaurus

Question

Efficient checking of possible duplicate entities

Answer 1

+3 A:

OR and CASE summing will work but are quite inefficient, since they don't use indexes.

You need to make UNION for indexes to be usable.

If a user enters name, phone, email and address into the database, and you want to check all records that match at least 3 of these fields, you issue:

SELECT  i.*
FROM    (
        SELECT  id, COUNT(*)
        FROM    (
                SELECT  id
                FROM    t_info t
                WHERE   name  = 'Eve Chianese'
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   phone = '+15558000042'
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   email = '[email protected]'
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   address = '42 North Lane'
                ) q
        GROUP BY
                id
        HAVING  COUNT(*) >= 3
        ) dq
JOIN    t_info i
ON      i.id = dq.id

This will use indexes on these fields and the query will be fast.

See this article in my blog for details:

Matching 3 of 4: how to match a record which matches at least 3 of 4 possible conditions

Also see this question the article is based upon.

If you want to have a list of DISTINCT values in the existing data, you just wrap this query into a subquery:

SELECT  i.*
FROM    t_info i1
WHERE   EXISTS
        (
        SELECT  1
        FROM    (
                SELECT  id
                FROM    t_info t
                WHERE   name  = i1.name
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   phone = i1.phone
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   email = i1.email
                UNION ALL
                SELECT  id
                FROM    t_info t
                WHERE   address = i1.address
                ) q
        GROUP BY
                id
        HAVING  COUNT(*) >= 3
        )

Note that this DISTINCT is not transitive: if A matches B and B matches C, this does not mean that A matches C.

Quassnoi 2009-05-05 16:14:55

Thanks, think this one looks like the best solution for my problem.We're still going to have to do a group by on a set with many elements, but doing a bit of testing this seems to be quicker than other ways I've tried.

JonC 2009-05-06 08:28:15

Answer 2

A:

What DBS are you using? Some support using such constraints by using server side code.

NineBerry 2009-05-05 16:15:49

Answer 3

A:

Have you considered using a stored procedure with a cursor? You could then do your OR query and then step through the records one-by-one looking for matches. Using a stored procedure would allow you to do all the checking on the server.

However, I think a table scan with millions of records is always going to be slow. I think you should work out which of the 7 fields are most likely to match are make sure these are indexed.

Dave Webb 2009-05-05 16:17:59

Answer 4

A:

I'm assuming your system is trying to match tag ids of a certain post, or something similar. This is a multi-to-multi relationship and you should have three tables to handle it. One for the post, one for tags and one for post and tags relationship.

If my assumptions are correct then the best way to handle this is:

SELECT postid, count(tagid) as common_tag_count
FROM posts_to_tags
WHERE tagid IN (tag1, tag2, tag3, ...)
GROUP BY postid
HAVING count(tagid) > 3;

Nadia Alramli 2009-05-05 16:19:32

Answer 5

+2 A:

You might want something like the following:

SELECT id
FROM 
    (select id, CASE fld1 WHEN input1 THEN 1 ELSE 0 "rule1",
        CASE fld2 when input2 THEN 1 ELSE 0 "rule2",
        ...,
        CASE fld7 when input7 THEN 1 ELSE 0 "rule2",
    FROM table)
WHERE rule1+rule2+rule3+...+rule4 >= 3

This isn't tested, but it shows a way to tackle this.

S.Lott 2009-05-05 16:23:05

ansaurus

tags:

views:

answers:

Efficient checking of possible duplicate entities

related questions