views:

195

answers:

2

I am creating a script that for "merging" and deleting duplicate rows from a table. The table contains address information, and uses an integer field for storing information about the email as bit flags (column name lngValue). For example, lngValue & 1 == 1 means its the primary address.

There are instances of the same email being entered twice, but sometimes with different lngValues. To resolve this, I need to take the lngValue from all duplicates and assign them to one surviving record and delete the rest.

My biggest headache so far as been with the "merging" of the records. What I want to do is bitwise or all lngValues of duplicate records together. Here is what I have so far, which only finds the value of all lngValues bitwise or'ed together.

Warning: messy code ahead

declare @duplicates table
(
lngInternetPK int,
lngContactFK int,
lngValue int
)

insert into @duplicates (lngInternetPK, lngContactFK, lngValue) 
(
select  tblminternet.lngInternetPK, tblminternet.lngContactFK, tblminternet.lngValue   from tblminternet  inner join 
(select strAddress, lngcontactfk, count(*) as count from tblminternet where lngValue & 256 <> 256 group by strAddress, lngcontactfk) secondemail
On tblminternet.strAddress = secondemail.strAddress and
tblminternet.lngcontactfk = secondemail.lngcontactfk 
where count > 1 and tblminternet.strAddress is not null and tblminternet.lngValue & 256 <> 256 --order by lngContactFK, strAddress
)

update @duplicates set lngValue = t.val

from 
    (select (sum(dupes.lngValue) & 65535) as val from 
     (select  here.lngInternetPK,                     here.lngContactFK, here.lngValue from tblminternet here  inner join 
     (select strAddress, lngcontactfk, count(*) as count from tblminternet where lngValue & 256 <> 256 group by strAddress, lngcontactfk) secondemail
     On here.strAddress = secondemail.strAddress     and
     here.lngcontactfk = secondemail.lngcontactfk 
     where count > 1 and here.strAddress is not      null and here.lngValue & 256 <> 256) dupes, tblminternet this

    where this.lngContactFK = dupes.lngContactFK
    ) t
where lngInternetPK in (select lngInternetPK from @duplicates)

Edit:
As requested here is some sample data:

Table Name: tblminternet
Column Names:
lngInternetPK
lngContactFK
lngValue
strAddress

Example row 1:
lngInternetPK: 1
lngContactFK: 1
lngValue: 33
strAddress: "[email protected]"

Example row 2:
lngInternetPK: 2
lngContactFK: 1
lngValue: 40
strAddress: "[email protected]"

If these two were merged here is the desired result:
lngInternetPK: 1
lngContactFK: 1
lngValue: 41
strAddress: "[email protected]"

Other necessary rules:
Each contact can have multiple emails, but each email row must be distinct ( each email can only appear as one row).

A: 

I believe the following query gets you what you want. This routine assumes a max of two duplicate addresses per contact. If there's more than one dup per contact, the query will have to be modified. I hope this helps.

Declare @tblminternet 
Table 
( lngInternetPK int,   
  lngContactFK int,  
  lngValue int, 
  strAddress varchar(255)
)

Insert Into @tblminternet 
select 1, 1, 33, '[email protected]' 
union
select 2, 1, 40, '[email protected]'
union 
select 3, 2, 33, '[email protected]'
union 
select 4, 2, 40, '[email protected]'
union 
select 5, 3, 2, '[email protected]'

--Select * from @tblminternet

Select  Distinct   
    A.lngContactFK , 
    A.lngValue | B.lngValue as 'Bitwise OR', 
    A.strAddress
From @tblminternet A, @tblminternet B
Where A.lngContactFK = B.lngContactFK
And A.strAddress = B.strAddress
And A.lngInternetPK != B.lngInternetPK
Ben Griswold
This gives me a good idea of where to start, but the problem is I am dealing with a few hundred. The temp table I made is a table of all duplicates existing in tblminternet.
Corey Sunwold
+1  A: 

SQL Server lacks native bitwise aggregates, that's why we need to emulate them.

The main idea here is to generate a set of bits from 0 to 15, for each bit apply the bitmask to the value and select MAX (which will give us an OR for a given bit), then select the SUM (which will merge the bit masks).

The we just update the first lngInternetPK for any given (lngContactFK, strValue) with the new value of lngValue, and delete all duplicates.

;WITH   bits AS
        (
        SELECT  0 AS b
        UNION ALL
        SELECT  b + 1
        FROM    bits
        WHERE   b < 15
        ),
        v AS
        (
        SELECT  i.*,
                (
                SELECT  SUM(value)
                FROM    (
                        SELECT  MAX(lngValue & POWER(2, b)) AS value
                        FROM    tblmInternet ii
                        CROSS JOIN
                                bits
                        WHERE   ii.lngContactFK = i.lngContactFK
                                AND ii.strAddress = i.strAddress
                        GROUP BY
                                b
                        ) q
                ) AS lngNewValue
        FROM    (
                SELECT  ii.*, ROW_NUMBER() OVER (PARTITION BY lngContactFK, strAddress ORDER BY lngInternetPK) AS rn
                FROM    tblmInternet ii
                ) i
        WHERE   rn = 1
        )
UPDATE  v
SET     lngValue = lngNewValue;

;WITH    v AS
        (
        SELECT  ii.*, ROW_NUMBER() OVER (PARTITION BY lngContactFK, strAddress ORDER BY lngInternetPK) AS rn
        FROM    tblmInternet ii
        )
DELETE  v
WHERE   rn > 1

See this article in my blog for more detailed explanations:

Quassnoi
This is good. The only problem is that each contact can have multiple distinct emails. So we don't want to delete all other emails associated with an contact.
Corey Sunwold
@csunwold: so you want to keep one instance of each `(contact, email)` pair?
Quassnoi
@Quassnoi: correct. There should be one instance of an email for every contact, but multiple distinct emails are allowed for each contact.
Corey Sunwold
@Quassnoi: I also get the error: "Incorrect syntax near the keyword 'as'." on the first line.
Corey Sunwold
@csunwold: see post update
Quassnoi
@Quassnoi: I am going to do some more testing tomorrow to verify all the data is as expected but it appears to work. Thank you. Can you explain why adding the semi colon in front of the with statements made a difference?
Corey Sunwold
@csunwold: Originally, Transact-SQL had no statement separators, i. e. `SELECT 1 SELECT 2` is a valid statement block with `2` statements in `Transact-SQL` but not `ANSI`. Later `Microsoft` decided to force statement separators to be `ANSI` conformant. Forcing them for legacy statements like the one above would break the backward compatibility, but turned out to be OK for the new statements like `WITH` (which was introduced in `SQL Server 2005`).
Quassnoi
@Quassnoi: Awesome work. Thank you I learned a lot.
Corey Sunwold