views:

208

answers:

2

I'm working on a problem in Oracle that I'm struggling to solve 'elegantly'.

I have a data extract with three different identifiers: A, B, C

Each identifier may appear in more than one row, and each row may have one or more of these three identifiers (i.e the column is populated or null).

I want to be able to group all records that have any combination of either A, B or C in common and assign them the same group id.

Extract table showing what the eventual groups should be:

Rownum | A    | B    | C    | End group
1        p      NULL   NULL   1
2        p      r      NULL   1
3        q      NULL   NULL   2
4        NULL   r      NULL   1
5        NULL   NULL   s      2
6        q      NULL   s      2

My original approach was to assign a guid to each row in the extract and create a lookup table for the three identifiers:

GUID | IDENTIFIER | IDENTIFIER TYPE | GROUP | END GROUP
1      p            A                 1       1
2      p            A                 1       1
2      r            B                 2       1
3      q            A                 3       3
4      r            B                 2       1
5      s            C                 4       3
6      q            A                 3       3
6      s            C                 4       3

Then group by identifier and assign a group number. The groups, however, need to be combined where possible to provide the view shown in end group.

The only solution I can think of for this problem is to use loops, which I'd rather avoid.

Any ideas would be greatly appreciated.

Niall

+2  A: 

Use merge instead of loop:

Table a(a,b,c,groupId)

Statement:

   merge into a
   USING (SELECT RANK() OVER(ORDER BY a,b,c) g, ROWID rid FROM a) SOURCE
   ON (a.ROWID = SOURCE.rid)
   WHEN MATCHED THEN UPDATE SET a.GroupId = SOURCE.g

It is same as:

    BEGIN
        FOR x IN ( SELECT RANK() OVER(ORDER BY a,b,c) g, ROWID rid FROM a)
        LOOP
             UPDATE a
                SET GroupId  = x.g
             WHERE a.RowId = x.rid;
        END LOOP;
    END;
Michael Pakhantsov
Also I would choose to use DENSE_RANK rather than RANK for this scenario.
APC
@APC, no need use NVL - its pointless, grouping will be done in same way.
Michael Pakhantsov
Using this solution, with or without these modifications, seems to produce a unique group_id for every row which obviously means that records are not being grouped. Am I missing something here? I'm a n00b regarding the merge command.
niallsco
+1 . Actually you are correct. It was the use of RANK() rather than DENSE_RANK() which threw out my test. So this solution will provide the desired outcome.
APC
@niallsco - I have tested Michael's solution and it works. So, yes, you are missing something.
APC
@niallsco You asked provide one statement. MERGE is good for this purpose. RANK() and DENSE_RANK() create same groupId for same combination of a,b,c
Michael Pakhantsov
This is correct, but the complexity of this problem is that two records which should have the same groupId do not necessarily have the same combination of a,b,c. They have at least one of a, b, c in common. Sometimes a, sometimes b, sometimes c, sometimes a combination of a, b and c.
niallsco
@niallsco Can you describe exactly rules for the grouping. From your example in the question it does not clear. Just add some more details. :) Re-read your requirement. Can you confirm, that row have same groupid if at least one of identitier value is equal?
Michael Pakhantsov
@Michael Thanks for your help, that's exactly correct. I've added an extra column to the original example to show what the end groupings would be.
niallsco
@niallsco Found another problem with grouping rules definitions. If we have have row1(a=a, b=null), row2(a=a, b=b), row3(a=k,b=b). So row2 can be in two groups. May you clarify this?
Michael Pakhantsov
@Michael in this case, all three records would be in the same group as between them they share a common identifier, 1 and 2 share a and 2 and 3 share b. The fact that one of the 1 and 3 have different values for a is insignificant.
niallsco
This problem can not be solved without looping. Tried do something with hierarchical queries - but without any luck. Because groupId depend on other rows and their connections with other rows it is impossible do in one step. Also any inserting new row will required determine new group Id, because new row can 'joink' several existing groups into one
Michael Pakhantsov
+2  A: 

Hi niallsco,

This is truly an interesting problem. Still, I think we are missing a definition of a "group". Since in your example (p,null,null) (row1) and (null,r,null) (row4) share no common identifier and belong to the same group I'll go with this definition for grouping:

A row belongs to a group if it shares at least one identifier with at least one row of this group.

This means we can "chain" rows. This naturally leads to a hierarchical solution:

SQL> SELECT ID, a, b, c, MIN(grp) grp
  2    FROM (SELECT connect_by_root(id) ID,
  3                 connect_by_root(a) a,
  4                 connect_by_root(b) b,
  5                 connect_by_root(c) c,
  6                 ID grp
  7             FROM a
  8           CONNECT BY NOCYCLE(PRIOR a = a
  9                           OR PRIOR b = b
 10                           OR PRIOR c = c))
 11   GROUP BY ID, a, b, c
 12   ORDER BY ID;

        ID A          B          C                 GRP
---------- ---------- ---------- ---------- ----------
         1 p                                         1
         2 p          r                              1
         3 q                                         3
         4            r                              1
         5                       s                   3
         6 q                     s                   3

6 rows selected

You can execute the subquery to understand the construction:

SQL> SELECT connect_by_root(id) ID,
  2         connect_by_root(a) a,
  3         connect_by_root(b) b,
  4         connect_by_root(c) c,
  5         substr(sys_connect_by_path(ID, '->'), 3) path,
  6         ID grp
  7    FROM a
  8  CONNECT BY NOCYCLE(a = PRIOR a
  9                  OR b = PRIOR b
 10                  OR c = PRIOR c);

        ID A          B          C          PATH            GRP
---------- ---------- ---------- ---------- -------- ----------
         1 p                                1                 1
         1 p                                1->2              2
         1 p                                1->2->4           4
         2 p          r                     2                 2
         2 p          r                     2->1              1
         2 p          r                     2->4              4
         3 q                                3                 3
         3 q                                3->6              6
         3 q                                3->6->5           5
         4            r                     4                 4
         4            r                     4->2              2
         4            r                     4->2->1           1
         5                       s          5                 5
         5                       s          5->6              6
         5                       s          5->6->3           3
         6 q                     s          6                 6
         6 q                     s          6->3              3
         6 q                     s          6->5              5

18 rows selected
Vincent Malgrat
@Vincent +1. Extremely elegant solution, I tried use "connect by", but without luck.
Michael Pakhantsov
+++ very nicely done.
Jeffrey Kemp