ansaurus

Question

Algorithm for merging sets that share at least 2 elements

Answer 1

+2 A:

If you can order the elements in the set, you can look into using Mergesort on the sets. The only modification needed is to check for duplicates during the merge phase. If one is found, just discard the duplicate. Since mergesort is O(n*log(n)), this will offer imrpoved speed when compared to the naive O(n^2) algorithm.

However, to really be effective, you should maintain a sorted set and keep it sorted, so that you can skip the sort phase and go straight to the merge phase.

Kyle Cronin 2008-11-23 21:18:00

i don't see how this addresses the issue of finding which sets have 2 or more elements in common. this just shows how to find the union of two sets, which i think is the easier part of this problem.

Claudiu 2008-11-24 04:25:37

I don't think knowing if a set has 2 or more elements in common helps at all. Since you don't know how many duplicates there are, there's no way you can stop checking for them.

Kyle Cronin 2008-11-24 05:20:43

Answer 2

A:

One side note: It depends on how often this occurs. If most pairs of sets do share at least two elements, it might be most efficient to build the new set at the same time as you are stepping through the comparison, and throw it away if they don't match the condition. If most pairs do not share at least two elements, then deferring the building of the new set until confirmation of the condition might be more efficient.

Svante 2008-11-23 21:23:56

Answer 3

A:

If your elements are numerical in nature, or can be naturally ordered (ie. you can assign a value such as 1, 2, 42 etc...), I would suggest using a radix sort on the merged sets, and make a second pass to pick up on the unique elements.

This algorithm should be of O(n), and you can optimize the radix sort quite a bit using bitwise shift operators and bit masks. I have done something similar for a project I was working on, and it works like a charm.

2008-11-23 21:41:13

Answer 4

+3 A:

Let there be a list of many Sets named (S)

Perform a pass through all elements of S, to determine the range (LOW .. HIGH).

Create an array of pointer to Set, of dimensions (LOW, HIGH), named (M).

do
    Init all elements of M to NULL.   

    Iterate though S, processing them one Set at a time, named (Si).

        Permutate all ordered pairs in Si. (P1, P2) where P1 <= P2.
        For each pair examine M(P1, P2)
            if M(P1, P2) is NULL
                Continue with the next pair.
            otherwise
                Merge Si, into the Set pointed to by, M(P1, P2).
                Remove Si from S, as it has been merged.
                Move on to processing Set S(i + 1)

        If Si was not merged, 
            Permutate again through Si
            For each pair, make M(P1, P2) point to Si.

while At least one set was merged during the pass.

My head is saying this is about Order (2N ln N). Take that with a grain of salt.

EvilTeach 2008-11-23 21:51:45

this assumes a set has all the elements between low and high in it, which isn't true - or am i mistaken?

Claudiu 2008-11-24 04:26:48

After you merge Si, you still have to permute through all pairs in Si and add them to M (pointed to M(P1, P2)) before moving on to Set S(i + 1), right? Otherwise, this looks good.

bajafresh4life 2008-11-24 14:26:51

What is supposed to happen with sets {1, 2, 3}, {2, 3, 4} and {1, 4}? First and second get merged, and the merged set has two duplicates with the third - should the third get merged, or is it only the original content of the sets that counts? I think this answer does the former, and not the latter

Paul 2008-11-24 17:29:04

Paul: yeah, this is the problem that I was trying to get at with my previous comment. When {2, 3, 4} gets merged with {1, 2, 3}, the permuted pairs from the new merged set need to be added to M.

bajafresh4life 2008-11-24 19:14:29

@Claudiu. No the set is not required to be contigueous.

EvilTeach 2008-11-24 20:25:19

@baj. No. That gets complex. What you describe happens in the next pass of the algorithm. It is not a goal, to do all merges in one pass.@Paul: first pass (123) and (234) merge. (1,4) does not. second pass (1,4) merges with (123234)

EvilTeach 2008-11-24 20:25:50

The other thing that hit me this morning is that the permutations should be precalculated, and recalculated after a merge for efficiency.

EvilTeach 2008-11-24 20:27:44

The nasty case for this one is {1, 2, 3}, {2, 3, 4}, {1, 4, 5} {2, 5, 6} {3, 6, 7}, {4, 7, 8}.., I think. The first run merges the first two only. The second merges the newly-merged first with the second, and the third the same thing. So you get N-1 runs of a loop that's N, N-1..1 long. So O(N^2).

Paul 2008-11-24 21:45:51

ya, as i mentioned below

EvilTeach 2008-11-25 00:35:41

Got it. In my use of this, the "nasty case" should be extremely rare. So I think this is the best solution. Thanks!

bajafresh4life 2008-11-25 03:19:07

@EvilTeach, despite my comments, I do think this algorithm's pretty cool :)

Paul 2008-11-25 06:43:21

@Paul I appreciate your comments. An exercise like this get the old mental juices flowing.

EvilTeach 2008-11-25 13:41:27

an analysis of all of the cases, to estimate what the actual order number is for the average case might be fun.

EvilTeach 2008-11-25 13:42:46

@baj : what did you end up doing?

EvilTeach 2009-01-22 16:45:01

Answer 5

+1 A:

I don't see how this can be done in less than O(n^2).

Every set needs to be compared to every other one to see if they contain 2 or more shared elements. That's n*(n-1)/2 comparisons, therefore O(n^2), even if the check for shared elements takes constant time.

In sorting, the naive implementation is O(n^2) but you can take advantage of the transitive nature of ordered comparison (so, for example, you know nothing in the lower partition of quicksort needs to be compared to anything in the upper partition, as it's already been compared to the pivot). This is what result in sorting being O(n * log n).

This doesn't apply here. So unless there's something special about the sets that allows us to skip comparisons based on the results of previous comparisons, it's going to be O(n^2) in general.

Paul.

Paul 2008-11-24 08:14:44

If the elements in the set can be ordered, duplicate items will always be adjacent to one another. This allows us to confine our search for them to the adjacent items, an O(1) operation, instead of searching for them every time, an O(n) operation.

Kyle Cronin 2008-11-24 08:46:24

"the elements in the set can be ordered..." Even if duplicate detection is O(1), there are still O(N^2) comparisons to do. And anyway, we're not looking for duplicate items within a set. We're looking for items duplicated between two sets. and they could be the first or last or any other.

Paul 2008-11-24 17:03:45

Ordering the elements in the set, does not in any way imply that the duplicate items will be adjecent. A pair is duplicate iff there is an identical pair in another set.

EvilTeach 2008-11-24 20:53:57

I tend to think you are correct Paul.The multipass thing i did above could have O(N*N) behavior. It really is a fuction of what the distribution of duplicates around the sets are. It may be that Set 1 and 2 merge, then next pass set 3 merges... all the way up to N. Kinda like quicksort.

EvilTeach 2008-11-24 20:56:07

ansaurus

tags:

views:

answers:

Algorithm for merging sets that share at least 2 elements

related questions