ansaurus

Question

SQL combine multiple identifiers to create a group id for duplicate records

Answer 1

+2 A:

Use merge instead of loop:

Table a(a,b,c,groupId)

Statement:

   merge into a
   USING (SELECT RANK() OVER(ORDER BY a,b,c) g, ROWID rid FROM a) SOURCE
   ON (a.ROWID = SOURCE.rid)
   WHEN MATCHED THEN UPDATE SET a.GroupId = SOURCE.g

It is same as:

    BEGIN
        FOR x IN ( SELECT RANK() OVER(ORDER BY a,b,c) g, ROWID rid FROM a)
        LOOP
             UPDATE a
                SET GroupId  = x.g
             WHERE a.RowId = x.rid;
        END LOOP;
    END;

Michael Pakhantsov 2010-07-28 10:03:42

Also I would choose to use DENSE_RANK rather than RANK for this scenario.

APC 2010-07-28 11:29:15

@APC, no need use NVL - its pointless, grouping will be done in same way.

Michael Pakhantsov 2010-07-28 12:04:27

Using this solution, with or without these modifications, seems to produce a unique group_id for every row which obviously means that records are not being grouped. Am I missing something here? I'm a n00b regarding the merge command.

niallsco 2010-07-28 12:09:46

+1 . Actually you are correct. It was the use of RANK() rather than DENSE_RANK() which threw out my test. So this solution will provide the desired outcome.

APC 2010-07-28 12:11:42

@niallsco - I have tested Michael's solution and it works. So, yes, you are missing something.

APC 2010-07-28 12:13:32

@niallsco You asked provide one statement. MERGE is good for this purpose. RANK() and DENSE_RANK() create same groupId for same combination of a,b,c

Michael Pakhantsov 2010-07-28 15:45:43

This is correct, but the complexity of this problem is that two records which should have the same groupId do not necessarily have the same combination of a,b,c. They have at least one of a, b, c in common. Sometimes a, sometimes b, sometimes c, sometimes a combination of a, b and c.

niallsco 2010-07-28 15:57:45

@niallsco Can you describe exactly rules for the grouping. From your example in the question it does not clear. Just add some more details. :) Re-read your requirement. Can you confirm, that row have same groupid if at least one of identitier value is equal?

Michael Pakhantsov 2010-07-28 16:07:15

@Michael Thanks for your help, that's exactly correct. I've added an extra column to the original example to show what the end groupings would be.

niallsco 2010-07-28 16:13:45

@niallsco Found another problem with grouping rules definitions. If we have have row1(a=a, b=null), row2(a=a, b=b), row3(a=k,b=b). So row2 can be in two groups. May you clarify this?

Michael Pakhantsov 2010-07-29 08:48:54

@Michael in this case, all three records would be in the same group as between them they share a common identifier, 1 and 2 share a and 2 and 3 share b. The fact that one of the 1 and 3 have different values for a is insignificant.

niallsco 2010-07-29 09:06:59

This problem can not be solved without looping. Tried do something with hierarchical queries - but without any luck. Because groupId depend on other rows and their connections with other rows it is impossible do in one step. Also any inserting new row will required determine new group Id, because new row can 'joink' several existing groups into one

Michael Pakhantsov 2010-07-29 15:26:08

Answer 2

+2 A:

Hi niallsco,

This is truly an interesting problem. Still, I think we are missing a definition of a "group". Since in your example (p,null,null) (row1) and (null,r,null) (row4) share no common identifier and belong to the same group I'll go with this definition for grouping:

A row belongs to a group if it shares at least one identifier with at least one row of this group.

This means we can "chain" rows. This naturally leads to a hierarchical solution:

SQL> SELECT ID, a, b, c, MIN(grp) grp
  2    FROM (SELECT connect_by_root(id) ID,
  3                 connect_by_root(a) a,
  4                 connect_by_root(b) b,
  5                 connect_by_root(c) c,
  6                 ID grp
  7             FROM a
  8           CONNECT BY NOCYCLE(PRIOR a = a
  9                           OR PRIOR b = b
 10                           OR PRIOR c = c))
 11   GROUP BY ID, a, b, c
 12   ORDER BY ID;

        ID A          B          C                 GRP
---------- ---------- ---------- ---------- ----------
         1 p                                         1
         2 p          r                              1
         3 q                                         3
         4            r                              1
         5                       s                   3
         6 q                     s                   3

6 rows selected

You can execute the subquery to understand the construction:

SQL> SELECT connect_by_root(id) ID,
  2         connect_by_root(a) a,
  3         connect_by_root(b) b,
  4         connect_by_root(c) c,
  5         substr(sys_connect_by_path(ID, '->'), 3) path,
  6         ID grp
  7    FROM a
  8  CONNECT BY NOCYCLE(a = PRIOR a
  9                  OR b = PRIOR b
 10                  OR c = PRIOR c);

        ID A          B          C          PATH            GRP
---------- ---------- ---------- ---------- -------- ----------
         1 p                                1                 1
         1 p                                1->2              2
         1 p                                1->2->4           4
         2 p          r                     2                 2
         2 p          r                     2->1              1
         2 p          r                     2->4              4
         3 q                                3                 3
         3 q                                3->6              6
         3 q                                3->6->5           5
         4            r                     4                 4
         4            r                     4->2              2
         4            r                     4->2->1           1
         5                       s          5                 5
         5                       s          5->6              6
         5                       s          5->6->3           3
         6 q                     s          6                 6
         6 q                     s          6->3              3
         6 q                     s          6->5              5

18 rows selected

Vincent Malgrat 2010-08-04 13:12:17

@Vincent +1. Extremely elegant solution, I tried use "connect by", but without luck.

Michael Pakhantsov 2010-08-04 13:45:52

+++ very nicely done.

Jeffrey Kemp 2010-08-10 03:39:06

ansaurus

tags:

views:

answers:

SQL combine multiple identifiers to create a group id for duplicate records

related questions