views:

277

answers:

4

Here's the jist of the problem: Given a list of sets, such as:

[ (1,2,3), (5,2,6), (7,8,9), (6,12,13), (21,8,34), (19,20) ]

Return a list of groups of the sets, such that sets that have a shared element are in the same group.

[ [ (1,2,3), (5,2,6), (6,12,13) ], [ (7,8,9), (21,8,34) ], [ (19,20) ] ]

Note the stickeyness - the set (6,12,13) doesn't have a shared element with (1,2,3), but they get put in the same group because of (5,2,6).

To complicate matters, I should mention that I don't really have these neat sets, but rather a DB table with several million rows that looks like:

element | set_id
----------------
1       | 1
2       | 1
3       | 1
5       | 2
2       | 2
6       | 2

and so on. So I would love a way to do it in SQL, but I would be happy with a general direction for the solution.

EDIT: Changed the table column names to (element, set_id) instead of (key, group_id), to make the terms more consistent. Note that Kev's answer uses the old column names.

+1  A: 

You could think of it as a graph problem where the set (1,2,3) is connected to the set (5,2,6) via the 2. And then use a standard algorithm to fine the connected sub-graphs.

Here's a quick python implementation:

nodes = [ [1,2,3], [2,4,5], [6,7,8], [10,11,12], [7,10,13], [12], [] ]
links = [ set() for x in nodes ]

#first find the links
for n in range(len(nodes)):
    for item in nodes[n]:
        for m in range(n+1, len(nodes)):
            if (item in nodes[m]):
                links[n].add(m)
                links[m].add(n)

sets = []
nodes_not_in_a_set = range(len(nodes))

while len(nodes_not_in_a_set) > 0:
    nodes_to_explore = [nodes_not_in_a_set.pop()]
    current_set = set()
    while len(nodes_to_explore) > 0:
        current_node = nodes_to_explore.pop()
        current_set.add(current_node)
        if current_node in nodes_not_in_a_set:
            nodes_not_in_a_set.remove(current_node)
        for l in links[current_node]:
            if l not in current_set and l not in nodes_to_explore:
                nodes_to_explore.append(l)
    if len(current_set) > 0:
        sets.append(current_set)

for s in sets:
    print [nodes[n] for n in s]

output:

[[]]
[[6, 7, 8], [10, 11, 12], [7, 10, 13], [12]]
[[1, 2, 3], [2, 4, 5]]
Matt Price
Could you elaborate a little?
itsadok
This is tough to do without lft,rgt strategy (unless you want loops galore for the tree traversal)
Pittsburgh DBA
A: 

This is likely pretty inefficient, but it should work, at least: Start with a key, select all the groups containing that key, select all the keys of those groups, select all the groups containing those keys, etc., and as soon as a step adds no new keys or groups, you have a list of all the groups of one sub-graph. Exclude those and repeat from the beginning until you have no data left.

In terms of SQL this would need a stored procedure, I think. WITH RECURSIVE might help you somehow, but I don't have any experience with it yet, and I'm not sure it's available on your DB backend.

Kev
A: 

After thinking about this some more, I came up with this:

  1. Create a table called groups with columns (group_id, set_id)
  2. Sort the sets table by element. Now it should be easy to find duplicate elements.
  3. Iterate through the sets table, and when you find a duplicate element do:
    1. if one of the set_id fields exists in the groups table, add the other one with the same group_id.
    2. If neither set_id exists in the groups table, generate a new group ID, and add both set_ids to the groups table.

In the end I should have a groups table containing all the sets.

This is not pure SQL, but seems like O(nlogn), so I guess I can live with that.

Matt's answer seems more correct, but I'm not sure how to implement it in my case.

itsadok
+6  A: 

The problem is exactly the computation of the connected components of an hypergraph: the integers are the vertices, and the sets are the hyperedges. A usual way of computing the connected components is by flooding them one after the other:

  • for all i = 1 to N, do:
  • if i has been tagged by some j < i, then continue (I mean skip to the next i)
  • else flood_from(i,i)

where flood_from(i,j) would be defined as

  • for each set S containing i, if it is not already tagged by j then:
  • tag S by j and for each element k of S, if k is not already tagged by j, then tag it by j, and call flood_from(k,j)

The tags of the sets then give you the connected components you are looking for.


In terms of databases, the algorithm can be expressed as follows: you add a TAG column to your database, and you compute the connected component of set i by doing

  • S = select all rows where set_id == i
  • set TAG to i for the rows in S
  • S' = select all rows where TAG is not set and where element is in element(S)
  • while S' is not empty, do
  • ---- set TAG to i for the rows in S'
  • ---- S'' = select all rows where TAG is not set and where element is in element(S')
  • ---- S = S union S'
  • ---- S' = S''
  • return set_id(S)


Another (theoretical) way of presenting this algorithm would be to say that you are looking for the fixed points of a mapping:

  • if A = {A1, ..., An} is a set of sets, define union(A) = A1 union ... union An
  • if K = {k1, ..., kp} is a set of integers, define incidences(K) = the set of sets which intersect K

Then if S is a set, the connected component of S is obtained by iterating (incidences)o(union) on S until a fixed point is reached:

  1. K = S
  2. K' = incidences(union(K)).
  3. if K == K', then return K, else K = K' and go to 2.
Camille
Kudos for the effort! Can you take a look at my answer and tell me if it's wrong, basically the same as your solution, or just a different solution?
itsadok
It seems to me that you would need some merging step for your solution to be complete: you may start different goup_ids for sets that should be in the same group, because you have not discovered that yet. If you have a duplicate and both sets are in different groups, merge the two groups.
Camille