ansaurus

Question

How to find duplicate values in SQL Server

Answer 1

+1 A:

The easiest would probably be to write a stored procedure to iterate over each group of customers with duplicates and insert the matching ones per group number respectively.

However, I've thought about it and you can probably do this with a subquery. Hopefully I haven't made it more complicated than it ought to, but this should get you what you're looking for for the first table of duplicates (all four fields). Note that this is untested, so it might need a little tweaking.

Basically, it gets each group of fields where there are duplicates, a group number for each, then gets all customers with those fields and assigns the same group number.

INSERT INTO FourFieldsDuplicates(group_no, customer_no)
SELECT Groups.group_no, custs.customer_no
FROM (SELECT ROW_NUMBER() OVER(ORDER BY c.field1) AS group_no,
             c.field1, c.field2, c.field3, c.field4
      FROM Customers c
      GROUP BY c.field1, c.field2, c.field3, c.field4
      HAVING COUNT(*) > 1) Groups
INNER JOIN Customers custs ON custs.field1 = Groups.field1
                           AND custs.field2 = Groups.field2
                           AND custs.field3 = Groups.field3
                           AND custs.field4 = Groups.field4

The other ones are a bit more complicated, however as you'll need to expand out the possibilities. The three-field groups would then be:

INSERT INTO ThreeFieldsDuplicates(group_no, customer_no)
SELECT Groups.group_no, custs.customer_no
FROM (SELECT ROW_NUMBER() OVER(ORDER BY GroupsInner.field1) AS group_no,
             GroupsInner.field1, GroupsInner.field2, 
             GroupsInner.field3, GroupsInner.field4
      FROM (SELECT c.field1, c.field2, c.field3, NULL AS field4
            FROM Customers c
            WHERE NOT EXISTS(SELECT d.customer_no
                       FROM FourFieldsDuplicates d
                       WHERE d.customer_no = c.customer_no)
            GROUP BY c.field1, c.field2, c.field3
            UNION ALL
            SELECT c.field1, c.field2, NULL AS field3, c.field4
            FROM Customers c
            WHERE NOT EXISTS(SELECT d.customer_no
                             FROM FourFieldsDuplicates d
                             WHERE d.customer_no = c.customer_no)
            GROUP BY c.field1, c.field2, c.field4
            UNION ALL
            SELECT c.field1, NULL AS field2, c.field3, c.field4
            FROM Customers c
            WHERE NOT EXISTS(SELECT d.customer_no
                             FROM FourFieldsDuplicates d
                             WHERE d.customer_no = c.customer_no)
            GROUP BY c.field1, c.field3, c.field4
            UNION ALL
            SELECT NULL AS field1, c.field2, c.field3, c.field4
            FROM Customers c
            WHERE NOT EXISTS(SELECT d.customer_no
                             FROM FourFieldsDuplicates d
                             WHERE d.customer_no = c.customer_no)
            GROUP BY c.field2, c.field3, c.field4) GroupsInner
      GROUP BY GroupsInner.field1, GroupsInner.field2, 
               GroupsInner.field3, GroupsInner.field4
      HAVING COUNT(*) > 1) Groups
INNER JOIN Customers custs ON (Groups.field1 IS NULL OR custs.field1 = Groups.field1)
                           AND (Groups.field2 IS NULL OR custs.field2 = Groups.field2)
                           AND (Groups.field3 IS NULL OR custs.field3 = Groups.field3)
                           AND (Groups.field4 IS NULL OR custs.field4 = Groups.field4)

Hopefully this produces the right results and I'll leave the last one as an exercise. :-D

lc 2010-05-20 10:05:39

@Ic Is it right to write "c.field1 as group_no" ? group_no is int and field1 is varchar.Maybe I should use some temp tables?

hgulyan 2010-05-20 10:18:57

@hgulyan It's actually ROW_NUMBER() as group_no.

lc 2010-05-20 10:25:48

@Ic, Yeah, I already made change to that. Still trying to run first script...

hgulyan 2010-05-20 10:27:49

Actually, it doesn't work. GroupId is unique for every row, because it just orders by field1 and I think it works before group by, that's why it just adds row_number to all rows, and I want to have groups with same ids for duplicates.

hgulyan 2010-05-20 10:34:10

It seems that first query worked:)

hgulyan 2010-05-20 11:19:55

Can there be any problem if one of fields is smalldatetime?

hgulyan 2010-05-20 11:33:09

@hgulyan If ROW_NUMBER() isn't working right, try wrapping it in a new subselect (`SELECT ROW_NUMBER(), ... FROM (SELECT ... GROUP BY ...)`). As the smalldatetime goes, I don't think it should matter as long as `=` works properly for them.

lc 2010-05-20 11:53:29

I'll mark this as the answer, because groupId works fine, but the query became really hard. I don't know, is there any possibility to get it right. Thank you anyway.

hgulyan 2010-05-21 09:57:32

Answer 2

+1 A:

I'm not sure if you require an equality check on different fields (like field1=field2).
Otherwise this might be enough.

Edit

Feel free to adjust the testdata to provide us with inputs that give a wrong output according to your specifications.

Test data

DECLARE @Customers TABLE (
  customer_number INTEGER IDENTITY(1, 1)
  , field1 INTEGER
  , field2 INTEGER
  , field3 INTEGER
  , field4 INTEGER)

INSERT INTO @Customers
          SELECT 1, 1, 1, 1
UNION ALL SELECT 1, 1, 1, 1
UNION ALL SELECT 1, 1, 1, NULL
UNION ALL SELECT 1, 1, 1, 2
UNION ALL SELECT 1, 1, 1, 3
UNION ALL SELECT 2, 1, 1, 1

All Equal

SELECT  ROW_NUMBER() OVER (ORDER BY c1.customer_number)
        , c1.field1
        , c1.field2
        , c1.field3
        , c1.field4
FROM    @Customers c1 
        INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number  
                                    AND ISNULL(c2.field1, 0) = ISNULL(c1.field1, 0) 
                                    AND ISNULL(c2.field2, 0) = ISNULL(c1.field2, 0)
                                    AND ISNULL(c2.field3, 0) = ISNULL(c1.field3, 0)
                                    AND ISNULL(c2.field4, 0) = ISNULL(c1.field4, 0)

One field different

SELECT  ROW_NUMBER() OVER (ORDER BY field1, field2, field3, field4)
        , field1
        , field2
        , field3
        , field4
FROM    (
          SELECT  DISTINCT c1.field1
                  , c1.field2
                  , c1.field3
                  , field4 = NULL
          FROM    @Customers c1 
                  INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number  
                                             AND c2.field1 = c1.field1 
                                             AND c2.field2 = c1.field2 
                                             AND c2.field3 = c1.field3 
                                             AND ISNULL(c2.field4, 0) <> ISNULL(c1.field4, 0) 
          UNION ALL
          SELECT  DISTINCT c1.field1
                  , c1.field2
                  , NULL
                  , c1.field4
          FROM    @Customers c1 
                  INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number  
                                             AND c2.field1 = c1.field1 
                                             AND c2.field2 = c1.field2 
                                             AND ISNULL(c2.field3, 0) <> ISNULL(c1.field3, 0) 
                                             AND c2.field4 = c1.field4 
          UNION ALL
          SELECT  DISTINCT c1.field1
                  , NULL
                  , c1.field3
                  , c1.field4
          FROM    @Customers c1 
                  INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number  
                                             AND c2.field1 = c1.field1 
                                             AND ISNULL(c2.field2, 0) <> ISNULL(c1.field2, 0) 
                                             AND c2.field3 = c1.field3 
                                             AND c2.field4 = c1.field4 
          UNION ALL
          SELECT  DISTINCT NULL
                  , c1.field2
                  , c1.field3
                  , c1.field4
          FROM    @Customers c1 
                  INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number  
                                             AND ISNULL(c2.field1, 0) <> ISNULL(c1.field1, 0)
                                             AND c2.field2 = c1.field2 
                                             AND c2.field3 = c1.field3 
                                             AND c2.field4 = c1.field4 
      ) c

Lieven 2010-05-20 10:46:10

Will INNER JOIN work if there're some null values?

hgulyan 2010-05-20 11:19:08

This is good and actually what I had at first, but the problem is then getting back out the groups to insert them into the new table...

lc 2010-05-20 11:51:36

Isn't there any way to add rownumber to this query?

hgulyan 2010-05-20 12:02:44

@hgulyan - what would you expect if both compared fields are NULL? A match or not?

Lieven 2010-05-20 12:27:01

match. If one of fields is not filled, it means that value for two rows is same.

hgulyan 2010-05-20 12:31:24

@Lieven, rownumber was needed for grouping as I wrote in my question.

hgulyan 2010-05-20 12:58:07

@hgulyan - I have updated the answer.

Lieven 2010-05-20 13:06:53

Thank you for your answer. GroupId doesn't work, but the query is great and useful.

hgulyan 2010-05-21 09:58:09

ansaurus

tags:

views:

answers:

How to find duplicate values in SQL Server

related questions