tags:

views:

382

answers:

7

This query for creating a list of Candidate duplicates is easy enough:

SELECT Count(*), Can_FName, Can_HPhone, Can_EMail
FROM Can 
GROUP BY Can_FName, Can_HPhone, Can_EMail
HAVING Count(*) > 1

But if the actual rule I want to check against is FName and (HPhone OR Email) - how can I adjust the GROUP BY to work with this?

I'm fairly certain I'm going to end up with a UNION SELECT here (i.e. do FName, HPhone on one and FName, EMail on the other and combine the results) - but I'd love to know if anyone knows an easier way to do it.

Thank you in advance for any help.

Scott in Maine

A: 

GROUP BY doesn't support OR - it's implicitly AND and must include every non-aggregator in the select list.

Arnshea
You might be able to hack something together with a user defined function, e.g., UserOR(hphone, email) then include that in the GROUP BY
Arnshea
btw, it'd have to be in the select list and in the group by clause. And the function would have to be deterministic but OR is deterministic so....
Arnshea
@Arnshea: an OR can give but three results, so he will get at most three groups.
Quassnoi
@Quassnoi, good point, he'd have to handle nulls, so use SQL Servers equivalent of NVL - e.g., UserOR(nvl(hphone, false), nvl(email, false)). Or exclude those in the where clause...
Arnshea
I do have a UserDefined function that checks for duplicates - one at a time though. I fear I'm going to end up setting up a Cursor using the UNION method mentioned below, then flip through this cursor one at a time and push out the data to the UserDefined function (it will pull back a definitive list (Can_PK) of duplicate entries). And only THEN can I actually go about the deduping of the Rows. Fun stuff.
Scott04073
+3  A: 

Before I can advise anything, I need to know the answer to this question:

name  phone      email

John  555-00-00  [email protected]
John  555-00-01  [email protected]
John  555-00-01  [email protected]

What COUNT(*) you want for this data?

Update:

If you just want to know that a record has any duplicates, use this:

WITH    q AS (
        SELECT  1 AS id, 'John' AS name, '555-00-00' AS phone, '[email protected]' AS email
        UNION ALL
        SELECT  2 AS id, 'John', '555-00-01', '[email protected]'
        UNION ALL
        SELECT  3 AS id, 'John', '555-00-01', '[email protected]'
        UNION ALL
        SELECT  4 AS id, 'James', '555-00-00', '[email protected]'
        UNION ALL
        SELECT  5 AS id, 'James', '555-00-01', '[email protected]'
        )
SELECT  *
FROM    q qo
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    q qi
        WHERE   qi.id <> qo.id
                AND qi.name = qo.name
                AND (qi.phone = qo.phone OR qi.email = qo.email)
        )

It's more efficient, but doesn't tell you where the duplicate chain started.

This query select all entries along with the special field, chainid, that indicates where the duplicate chain started.

WITH    q AS (
        SELECT  1 AS id, 'John' AS name, '555-00-00' AS phone, '[email protected]' AS email
        UNION ALL
        SELECT  2 AS id, 'John', '555-00-01', '[email protected]'
        UNION ALL
        SELECT  3 AS id, 'John', '555-00-01', '[email protected]'
        UNION ALL
        SELECT  4 AS id, 'James', '555-00-00', '[email protected]'
        UNION ALL
        SELECT  5 AS id, 'James', '555-00-01', '[email protected]'
        ),
        dup AS (
        SELECT  id AS chainid, id, name, phone, email, 1 as d
        FROM    q
        UNION ALL
        SELECT  chainid, qo.id, qo.name, qo.phone, qo.email, d + 1
        FROM    dup
        JOIN    q qo
        ON      qo.name = dup.name
                AND (qo.phone = dup.phone OR qo.email = dup.email)
                AND qo.id > dup.id
        ),
        chains AS 
        (
        SELECT  *
        FROM    dup do
        WHERE   chainid NOT IN
                (
                SELECT  id
                FROM    dup di
                WHERE   di.chainid < do.chainid
                )
        )
SELECT  *
FROM    chains
ORDER BY
        chainid
Quassnoi
According to the definition - that's a Count(*) of 3. Hence the complexity. Thank you.
Scott04073
Actually - let me look at that again. Be right back.
Scott04073
OK - definitely 3. Had to double-check on that.
Scott04073
@Scott040703: and if we remove the second rows, you want two COUNT(*)'s of 1, right?
Quassnoi
Exactly ... thanks for looking into this.
Scott04073
I've been following your changes - I was with you on the UNION. It's basically where I think I'm going to end up, create a Cursor from the Union - used my DefinedFunction to pull out the list of duplicates (via Can_PK) for each row of the Cursor, and then act on each of the return _PK values (assigning the lesser _PK values to the Max(_PK) value). But you lost me on the filtering you're doing (q AS ... ).
Scott04073
Scott: you are having duplicate chains here, i. e. one duplicate pulls another one an so on, so you can get a completely different phone-email pair that the original one and they will still be duplicates. This is done recursively here, in this case these rows get selected: 1-2-3, 2-3, 3, 4, 5. 2-3 and 3 are subsets of 1-2-3, so they get filtered out.
Quassnoi
A: 

I assume you also have a unique ID integer as the primary key on this table. If you don't, it's a good idea to have one, for this purpose and many others.

Find those duplicates by a self-join:

select
  c1.ID 
, c1.Can_FName
, c1.Can_HPhone
, c1.Can_Email
, c2.ID 
, c2.Can_FName
, c2.Can_HPhone
, c2.Can_Email
from
(
  select 
      min(ID), 
      Can_FName, 
      Can_HPhone, 
      Can_Email 
  from Can 
  group by 
      Can_FName, 
      Can_HPhone, 
      Can_Email
) c1
inner join Can c2 on c1.ID < c2.ID 
where
    c1.Can_FName = c2.Can_FName 
and (c1.Can_HPhone = c2.Can_HPhone OR c1.Can_Email = c2.Can_Email)
order by
  c1.ID

The query gives you N-1 rows for each N duplicate combinations - if you want just a count along with each unique combination, count the rows grouped by the "left" side:

select count(1) + 1,
, c1.Can_FName
, c1.Can_HPhone
, c1.Can_Email
from 
(
  select 
      min(ID), 
      Can_FName, 
      Can_HPhone, 
      Can_Email 
  from Can 
  group by 
      Can_FName, 
      Can_HPhone, 
      Can_Email
) c1
inner join Can c2 on c1.ID < c2.ID 
where
    c1.Can_FName = c2.Can_FName 
and (c1.Can_HPhone = c2.Can_HPhone OR c1.Can_Email = c2.Can_Email)
group by 
  c1.Can_FName
, c1.Can_HPhone
, c1.Can_Email

Granted, this is more involved than a union - but I think it illustrates a good way of thinking about duplicates.

Jeff Meatball Yang
A: 

Project the desired transformation first from a derived table, then do the aggregation:

SELECT COUNT(*) 
    , CAN_FName
    , Can_HPhoneOrEMail
    FROM (
     SELECT Can_FName 
      , ISNULL(Can_HPhone,'') +  ISNULL(Can_EMail,'')  AS Can_HPhoneOrEMail
     FROM Can) AS Can_Transformed
    GROUP BY Can_FName, Can_HPhoneOrEMail
    HAVING Count(*) > 1

Adjust your 'OR' operation as needed in the derived table project list.

Remus Rusanu
This seems to be more of an AND situation you're describing rather than an "OR" - Thank you for the effort though.
Scott04073
A: 

I know this answer will be criticised for the use of the temp table, but it will work anyway:

-- create temp table to give the table a unique key
create table #tmp(
ID int identity,
can_Fname varchar(200) null, -- real type and len here
can_HPhone varchar(200) null, -- real type and len here
can_Email varchar(200) null, -- real type and len here
)

-- just copy the rows where a duplicate fname exits 
-- (better performance specially for a big table)
insert into #tmp 
select can_fname,can_hphone,can_email
from Can 
where can_fname exists in (select can_fname from Can 
group by can_fname having count(*)>1)

-- select the rows that have the same fname and 
-- at least the same phone or email
select can_Fname, can_Hphone, can_Email  
from #tmp a where exists
(select * from #tmp b where
a.ID<>b.ID and A.can_fname = b.can_fname
and (isnull(a.can_HPhone,'')=isnull(b.can_HPhone,'')
or  (isnull(a.can_email,'')=isnull(b.can_email,'') )
tekBlues
Sometimes temp tables are the best, or even (rarely) only, way to solve a problem....
RolandTumble
A: 

Try this:

SELECT Can_FName, COUNT(*)
FROM (
SELECT 
rank() over(partition by Can_FName order by  Can_FName,Can_HPhone) rnk_p,
rank() over(partition by Can_FName order by  Can_FName,Can_EMail) rnk_m,
Can_FName
FROM Can
) X
WHERE rnk_p=1 or rnk_m =1
GROUP BY Can_FName
HAVING COUNT(*)>1
msi77
A: 

None of these answers is correct. Quassnoi's is a decent approach, but you will notice one fatal flaw in the expressions "qo.id > dup.id" and "di.chainid < do.chainid": comparisons made by ID! This is ALWAYS bad practice because it depends on some inherent ordering in the IDs. IDs should NEVER be given any implicit meaning and should ONLY participate in equality or null testing. You can easily break Quassnoi's solution in this example by simply reordering the IDs in the data.

The essential problem is a disjunctive condition with a grouping, which leads to the possibility of two records being related through an intermediate, though they are not directly relatable.

e.g., you stated these records should all be grouped:

(1) John 555-00-00 [email protected]

(2) John 555-00-01 [email protected]

(3) John 555-00-01 [email protected]

You can see that #1 and #2 are relatable, as are #2 and #3, but clearly #1 and #3 are not directly relatable as a group.

This establishes that a recursive or iterative solution is the ONLY possible solution.

So, recursion is not viable since you can easily end up in a looping situation. This is what Quassnoi was trying to avoid with his ID comparisons, but in doing so he broke the algorithm. You could try limiting the levels of recursion, but you may not then complete all relations, and you will still potentially be following loops back upon yourself, leading to excessive data size and prohibitive inefficiency.

The best solution is ITERATIVE: Start a result set by tagging each ID as a unique group ID, and then spin through the result set and update it, combining IDs into the same unique group ID as they match on the disjunctive condition. Repeat the process on the updated set each time until no further updates can be made.

I will create example code for this soon.

Michael Petter
@Michael: within the scope of a query, the ids are stable. If they are not comparable or don't exists, you can replace them with a generated `ROW_NUMBER`.
Quassnoi
Basically what I ended up doing. I grouped on either aspect - then did a full comparison via DB Function.
Scott04073