views:

475

answers:

5

Given a list of sets:

  • S_1 : [ 1, 2, 3, 4 ]
  • S_2 : [ 3, 4, 5, 6, 7 ]
  • S_3 : [ 8, 9, 10, 11 ]
  • S_4 : [ 1, 8, 12, 13 ]
  • S_5 : [ 6, 7, 14, 15, 16, 17 ]

What the most efficient way to merge all sets that share at least 2 elements? I suppose this is similar to a connected components problem. So the result would be:

  • [ 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17] (S_1 UNION S_2 UNION S_5)
  • [ 8, 9, 10, 11 ]
  • [ 1, 8, 12, 13 ] (S_4 shares 1 with S_1, and 8 with S_3, but not merged because they only share one element in each)

The naive implementation is O(N^2), where N is the number of sets, which is unworkable for us. This would need to be efficient for millions of sets.

+2  A: 

If you can order the elements in the set, you can look into using Mergesort on the sets. The only modification needed is to check for duplicates during the merge phase. If one is found, just discard the duplicate. Since mergesort is O(n*log(n)), this will offer imrpoved speed when compared to the naive O(n^2) algorithm.

However, to really be effective, you should maintain a sorted set and keep it sorted, so that you can skip the sort phase and go straight to the merge phase.

Kyle Cronin
i don't see how this addresses the issue of finding which sets have 2 or more elements in common. this just shows how to find the union of two sets, which i think is the easier part of this problem.
Claudiu
I don't think knowing if a set has 2 or more elements in common helps at all. Since you don't know how many duplicates there are, there's no way you can stop checking for them.
Kyle Cronin
A: 

One side note: It depends on how often this occurs. If most pairs of sets do share at least two elements, it might be most efficient to build the new set at the same time as you are stepping through the comparison, and throw it away if they don't match the condition. If most pairs do not share at least two elements, then deferring the building of the new set until confirmation of the condition might be more efficient.

Svante
A: 

If your elements are numerical in nature, or can be naturally ordered (ie. you can assign a value such as 1, 2, 42 etc...), I would suggest using a radix sort on the merged sets, and make a second pass to pick up on the unique elements.

This algorithm should be of O(n), and you can optimize the radix sort quite a bit using bitwise shift operators and bit masks. I have done something similar for a project I was working on, and it works like a charm.

+3  A: 

Let there be a list of many Sets named (S)

Perform a pass through all elements of S, to determine the range (LOW .. HIGH).

Create an array of pointer to Set, of dimensions (LOW, HIGH), named (M).

do
    Init all elements of M to NULL.   

    Iterate though S, processing them one Set at a time, named (Si).

        Permutate all ordered pairs in Si. (P1, P2) where P1 <= P2.
        For each pair examine M(P1, P2)
            if M(P1, P2) is NULL
                Continue with the next pair.
            otherwise
                Merge Si, into the Set pointed to by, M(P1, P2).
                Remove Si from S, as it has been merged.
                Move on to processing Set S(i + 1)

        If Si was not merged, 
            Permutate again through Si
            For each pair, make M(P1, P2) point to Si.

while At least one set was merged during the pass.

My head is saying this is about Order (2N ln N). Take that with a grain of salt.

EvilTeach
this assumes a set has all the elements between low and high in it, which isn't true - or am i mistaken?
Claudiu
After you merge Si, you still have to permute through all pairs in Si and add them to M (pointed to M(P1, P2)) before moving on to Set S(i + 1), right? Otherwise, this looks good.
bajafresh4life
What is supposed to happen with sets {1, 2, 3}, {2, 3, 4} and {1, 4}? First and second get merged, and the merged set has two duplicates with the third - should the third get merged, or is it only the original content of the sets that counts? I think this answer does the former, and not the latter
Paul
Paul: yeah, this is the problem that I was trying to get at with my previous comment. When {2, 3, 4} gets merged with {1, 2, 3}, the permuted pairs from the new merged set need to be added to M.
bajafresh4life
@Claudiu. No the set is not required to be contigueous.
EvilTeach
@baj. No. That gets complex. What you describe happens in the next pass of the algorithm. It is not a goal, to do all merges in one pass.@Paul: first pass (123) and (234) merge. (1,4) does not. second pass (1,4) merges with (123234)
EvilTeach
The other thing that hit me this morning is that the permutations should be precalculated, and recalculated after a merge for efficiency.
EvilTeach
The nasty case for this one is {1, 2, 3}, {2, 3, 4}, {1, 4, 5} {2, 5, 6} {3, 6, 7}, {4, 7, 8}.., I think. The first run merges the first two only. The second merges the newly-merged first with the second, and the third the same thing. So you get N-1 runs of a loop that's N, N-1..1 long. So O(N^2).
Paul
ya, as i mentioned below
EvilTeach
Got it. In my use of this, the "nasty case" should be extremely rare. So I think this is the best solution. Thanks!
bajafresh4life
@EvilTeach, despite my comments, I do think this algorithm's pretty cool :)
Paul
@Paul I appreciate your comments. An exercise like this get the old mental juices flowing.
EvilTeach
an analysis of all of the cases, to estimate what the actual order number is for the average case might be fun.
EvilTeach
@baj : what did you end up doing?
EvilTeach
+1  A: 

I don't see how this can be done in less than O(n^2).

Every set needs to be compared to every other one to see if they contain 2 or more shared elements. That's n*(n-1)/2 comparisons, therefore O(n^2), even if the check for shared elements takes constant time.

In sorting, the naive implementation is O(n^2) but you can take advantage of the transitive nature of ordered comparison (so, for example, you know nothing in the lower partition of quicksort needs to be compared to anything in the upper partition, as it's already been compared to the pivot). This is what result in sorting being O(n * log n).

This doesn't apply here. So unless there's something special about the sets that allows us to skip comparisons based on the results of previous comparisons, it's going to be O(n^2) in general.

Paul.

Paul
If the elements in the set can be ordered, duplicate items will always be adjacent to one another. This allows us to confine our search for them to the adjacent items, an O(1) operation, instead of searching for them every time, an O(n) operation.
Kyle Cronin
"the elements in the set can be ordered..." Even if duplicate detection is O(1), there are still O(N^2) comparisons to do. And anyway, we're not looking for duplicate items within a set. We're looking for items duplicated between two sets. and they could be the first or last or any other.
Paul
Ordering the elements in the set, does not in any way imply that the duplicate items will be adjecent. A pair is duplicate iff there is an identical pair in another set.
EvilTeach
I tend to think you are correct Paul.The multipass thing i did above could have O(N*N) behavior. It really is a fuction of what the distribution of duplicates around the sets are. It may be that Set 1 and 2 merge, then next pass set 3 merges... all the way up to N. Kinda like quicksort.
EvilTeach