views:

200

answers:

2

How to compare groups of tuples in sql: consider the following example:

TABLE T1
--------
GROUP     VALUE
-----     -----
A         FOO
A         BAR
X         HHH
X         ZOO

TABLE T2
--------
GROUP     VALUE
-----     -----
B         ZOO
C         FOO
C         BAR

I want to write an sql query which compares the groups of values in both tables and reports the differences. In the illustrated example, the group in table a: ((A,FOO),(A,BAR)) is the same as the group ((C,FOO),(C,BAR)) even though the group name is different. What counts is that the contents of the group are the same. Finally the query would report that there is a difference: it is the (B,ZOO) tuple.

RESULT
------
GROUP     VALUE
-----     -----
B         ZOO
X         HHH
X         ZOO

Although the group X containing ZOO in T1 has a matching value in T2: (B,ZOO) it is still not a match because the group also has the (X, HHH) value which is not part of the (B, ZOO) group in T2

A: 

Difference between T1 and T2 (two tables) could be this:

SELECT
   T1.GROUPNAME,
   T1.VALUE
FROM 
   T1
LEFT JOIN T2
ON T2.Value = T1.Value
WHERE T2.GROUPNAME IS NULL

For instance T1 has:

Foo 100 Bar 200 ZZZ 333

And T2 includes: Foo 100 Bar 200

The result of that query is ZZZ 333 it is the only record that does not match in both tables. You could even change T2's group name to say:

XYZ 100 ZXZ 200

And the result is still ZZZ 333. This is per what you are asking, if you want the opposite you can either UNION to it, or use RIGHT join.

Jon

JonH
Actually I care that the values in T2 are parts of the same group.
lewap
I dont understand what you are asking. Again provide DDL for all your tables / schema. Provide good sample data and expected results. What do you mean you care that the values in T2 are parts of the same group? Someone in the comments asked you about group names and you said it doesn't matter abotu the group name just that their values match in both tables. My query handles that.
JonH
+1  A: 

Something like this

create table t1 (group_id varchar2(20), value varchar2(20));
create table t2 (group_id varchar2(20), value varchar2(20));

insert into t1 values ('A','FOO');
insert into t1 values ('A','BAR');
insert into t1 values ('X','HHH');
insert into t1 values ('X','ZOO');
insert into t2 values ('C','FOO');
insert into t2 values ('C','BAR');
insert into t2 values ('B','ZOO');


select t1.group_id t1_group,t2.group_id t2_group, 
      --t1.all_val, t2.all_val, 
       case when t1.all_val = t2.all_val then 'match' else 'no match' end coll_match
from 
  (select 'T1' tab_id, group_id, collect(value) all_val, 
          min(value) min_val, max(value) max_val, count(distinct value) cnt_val 
  from t1 group by group_id) t1
full outer join
  (select 'T2' tab_id, group_id, collect(value) all_val, 
          min(value) min_val, max(value) max_val, count(distinct value) cnt_val 
  from t2 group by group_id) t2
on t1.min_val = t2.min_val and t1.max_val = t2.max_val and t1.cnt_val = t2.cnt_val
/

I've done a preliminary elimination based on the minmium, maximum and number of distinct values in each group, which would help with large datasets. If the datasets are small enough, you may not need them.

That tells you the matches. You just need to push it out an extra step to find the groups that don't have any matches

select t1_group
from
(
  select t1.group_id t1_group,t2.group_id t2_group, 
        --t1.all_val, t2.all_val, 
         case when t1.all_val = t2.all_val then 'match' end coll_match
  from 
    (select 'T1' tab_id, group_id, collect(value) all_val
    from t1 group by group_id) t1
  cross join
    (select 'T2' tab_id, group_id, collect(value) all_val
    from t2 group by group_id) t2
)
group by t1_group
having min(coll_match) is null
/

select t2_group
from
(
  select t1.group_id t1_group,t2.group_id t2_group, 
        --t1.all_val, t2.all_val, 
         case when t1.all_val = t2.all_val then 'match' end coll_match
  from 
    (select 'T1' tab_id, group_id, collect(value) all_val
    from t1 group by group_id) t1
  cross join
    (select 'T2' tab_id, group_id, collect(value) all_val
    from t2 group by group_id) t2
)
group by t2_group
having min(coll_match) is null
/
Gary
I don't see exactly, what do you use the min / max / count values for?
lewap
If you have a large data set (tens/hundreds of thousands of rows) comparing every group in table1 to every group in table2 would be a big task. The min/max/count would cut that down to a more manageable level by quickly excluding groups with different numbers of members or with different minimum/maximum values.
Gary
What if the min / max / count values are the same for a group, but the groups differ?
lewap
I had to do some changes, i.e. I don't use the min/max/count but the contents of the all_vall, but besides: you're the best! Thanks a lot, your answer was very useful!
lewap
I'll wait until I'll be able to offer you 100 points as a bounty.
lewap