views:

410

answers:

2

Hi,

I'm using SQL Server 2008. I have a table

Customers

customer_number int

field1 varchar

field2 varchar

field3 varchar

field4 varchar

... and a lot more columns, that don't matter for my queries.

Column *customer_number* is pk. I'm trying to find duplicate values and some differences between them.

Please, help me to find all rows that have same

1) field1, field2, field3, field4

2) only 3 columns are equal and one of them isn't (except rows from list 1)

3) only 2 columns equal and two of them aren't (except rows from list 1 and list 2)

In the end, I'll have 3 tables with this results and additional groupId, which will be same for a group of similar (For example, for 3 column equals, rows that have 3 same columns equal will be a separate group)

Thank you.

+1  A: 

The easiest would probably be to write a stored procedure to iterate over each group of customers with duplicates and insert the matching ones per group number respectively.

However, I've thought about it and you can probably do this with a subquery. Hopefully I haven't made it more complicated than it ought to, but this should get you what you're looking for for the first table of duplicates (all four fields). Note that this is untested, so it might need a little tweaking.

Basically, it gets each group of fields where there are duplicates, a group number for each, then gets all customers with those fields and assigns the same group number.

INSERT INTO FourFieldsDuplicates(group_no, customer_no)
SELECT Groups.group_no, custs.customer_no
FROM (SELECT ROW_NUMBER() OVER(ORDER BY c.field1) AS group_no,
             c.field1, c.field2, c.field3, c.field4
      FROM Customers c
      GROUP BY c.field1, c.field2, c.field3, c.field4
      HAVING COUNT(*) > 1) Groups
INNER JOIN Customers custs ON custs.field1 = Groups.field1
                           AND custs.field2 = Groups.field2
                           AND custs.field3 = Groups.field3
                           AND custs.field4 = Groups.field4

The other ones are a bit more complicated, however as you'll need to expand out the possibilities. The three-field groups would then be:

INSERT INTO ThreeFieldsDuplicates(group_no, customer_no)
SELECT Groups.group_no, custs.customer_no
FROM (SELECT ROW_NUMBER() OVER(ORDER BY GroupsInner.field1) AS group_no,
             GroupsInner.field1, GroupsInner.field2, 
             GroupsInner.field3, GroupsInner.field4
      FROM (SELECT c.field1, c.field2, c.field3, NULL AS field4
            FROM Customers c
            WHERE NOT EXISTS(SELECT d.customer_no
                       FROM FourFieldsDuplicates d
                       WHERE d.customer_no = c.customer_no)
            GROUP BY c.field1, c.field2, c.field3
            UNION ALL
            SELECT c.field1, c.field2, NULL AS field3, c.field4
            FROM Customers c
            WHERE NOT EXISTS(SELECT d.customer_no
                             FROM FourFieldsDuplicates d
                             WHERE d.customer_no = c.customer_no)
            GROUP BY c.field1, c.field2, c.field4
            UNION ALL
            SELECT c.field1, NULL AS field2, c.field3, c.field4
            FROM Customers c
            WHERE NOT EXISTS(SELECT d.customer_no
                             FROM FourFieldsDuplicates d
                             WHERE d.customer_no = c.customer_no)
            GROUP BY c.field1, c.field3, c.field4
            UNION ALL
            SELECT NULL AS field1, c.field2, c.field3, c.field4
            FROM Customers c
            WHERE NOT EXISTS(SELECT d.customer_no
                             FROM FourFieldsDuplicates d
                             WHERE d.customer_no = c.customer_no)
            GROUP BY c.field2, c.field3, c.field4) GroupsInner
      GROUP BY GroupsInner.field1, GroupsInner.field2, 
               GroupsInner.field3, GroupsInner.field4
      HAVING COUNT(*) > 1) Groups
INNER JOIN Customers custs ON (Groups.field1 IS NULL OR custs.field1 = Groups.field1)
                           AND (Groups.field2 IS NULL OR custs.field2 = Groups.field2)
                           AND (Groups.field3 IS NULL OR custs.field3 = Groups.field3)
                           AND (Groups.field4 IS NULL OR custs.field4 = Groups.field4)

Hopefully this produces the right results and I'll leave the last one as an exercise. :-D

lc
@Ic Is it right to write "c.field1 as group_no" ? group_no is int and field1 is varchar.Maybe I should use some temp tables?
hgulyan
@hgulyan It's actually ROW_NUMBER() as group_no.
lc
@Ic, Yeah, I already made change to that. Still trying to run first script...
hgulyan
Actually, it doesn't work. GroupId is unique for every row, because it just orders by field1 and I think it works before group by, that's why it just adds row_number to all rows, and I want to have groups with same ids for duplicates.
hgulyan
It seems that first query worked:)
hgulyan
Can there be any problem if one of fields is smalldatetime?
hgulyan
@hgulyan If ROW_NUMBER() isn't working right, try wrapping it in a new subselect (`SELECT ROW_NUMBER(), ... FROM (SELECT ... GROUP BY ...)`). As the smalldatetime goes, I don't think it should matter as long as `=` works properly for them.
lc
I'll mark this as the answer, because groupId works fine, but the query became really hard. I don't know, is there any possibility to get it right. Thank you anyway.
hgulyan
+1  A: 

I'm not sure if you require an equality check on different fields (like field1=field2).
Otherwise this might be enough.

Edit

Feel free to adjust the testdata to provide us with inputs that give a wrong output according to your specifications.

Test data

DECLARE @Customers TABLE (
  customer_number INTEGER IDENTITY(1, 1)
  , field1 INTEGER
  , field2 INTEGER
  , field3 INTEGER
  , field4 INTEGER)

INSERT INTO @Customers
          SELECT 1, 1, 1, 1
UNION ALL SELECT 1, 1, 1, 1
UNION ALL SELECT 1, 1, 1, NULL
UNION ALL SELECT 1, 1, 1, 2
UNION ALL SELECT 1, 1, 1, 3
UNION ALL SELECT 2, 1, 1, 1

All Equal

SELECT  ROW_NUMBER() OVER (ORDER BY c1.customer_number)
        , c1.field1
        , c1.field2
        , c1.field3
        , c1.field4
FROM    @Customers c1 
        INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number  
                                    AND ISNULL(c2.field1, 0) = ISNULL(c1.field1, 0) 
                                    AND ISNULL(c2.field2, 0) = ISNULL(c1.field2, 0)
                                    AND ISNULL(c2.field3, 0) = ISNULL(c1.field3, 0)
                                    AND ISNULL(c2.field4, 0) = ISNULL(c1.field4, 0)

One field different

SELECT  ROW_NUMBER() OVER (ORDER BY field1, field2, field3, field4)
        , field1
        , field2
        , field3
        , field4
FROM    (
          SELECT  DISTINCT c1.field1
                  , c1.field2
                  , c1.field3
                  , field4 = NULL
          FROM    @Customers c1 
                  INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number  
                                             AND c2.field1 = c1.field1 
                                             AND c2.field2 = c1.field2 
                                             AND c2.field3 = c1.field3 
                                             AND ISNULL(c2.field4, 0) <> ISNULL(c1.field4, 0) 
          UNION ALL
          SELECT  DISTINCT c1.field1
                  , c1.field2
                  , NULL
                  , c1.field4
          FROM    @Customers c1 
                  INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number  
                                             AND c2.field1 = c1.field1 
                                             AND c2.field2 = c1.field2 
                                             AND ISNULL(c2.field3, 0) <> ISNULL(c1.field3, 0) 
                                             AND c2.field4 = c1.field4 
          UNION ALL
          SELECT  DISTINCT c1.field1
                  , NULL
                  , c1.field3
                  , c1.field4
          FROM    @Customers c1 
                  INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number  
                                             AND c2.field1 = c1.field1 
                                             AND ISNULL(c2.field2, 0) <> ISNULL(c1.field2, 0) 
                                             AND c2.field3 = c1.field3 
                                             AND c2.field4 = c1.field4 
          UNION ALL
          SELECT  DISTINCT NULL
                  , c1.field2
                  , c1.field3
                  , c1.field4
          FROM    @Customers c1 
                  INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number  
                                             AND ISNULL(c2.field1, 0) <> ISNULL(c1.field1, 0)
                                             AND c2.field2 = c1.field2 
                                             AND c2.field3 = c1.field3 
                                             AND c2.field4 = c1.field4 
      ) c
Lieven
Will INNER JOIN work if there're some null values?
hgulyan
This is good and actually what I had at first, but the problem is then getting back out the groups to insert them into the new table...
lc
Isn't there any way to add rownumber to this query?
hgulyan
@hgulyan - what would you expect if both compared fields are NULL? A match or not?
Lieven
match. If one of fields is not filled, it means that value for two rows is same.
hgulyan
@Lieven, rownumber was needed for grouping as I wrote in my question.
hgulyan
@hgulyan - I have updated the answer.
Lieven
Thank you for your answer. GroupId doesn't work, but the query is great and useful.
hgulyan