views:

287

answers:

4

I am looking for an abstract data structure which represents a collection of sets such that no set in the collection is a subset of another set in the collection.

This means that on insert the following conditions will be met:

A. Inserting an element that is already a subset of another element will return the original collection.

B. Inserting an element that is a superset of any other elements will result in a collection with the superset added and the subsets removed.

Assuming an ordering on the elements of the set, then a prefix tree can be used to represent the collection. This permits condition A to be handled very quickly (ie it takes no longer to check the condition than it would to insert the subset) however meeting condition B takes time.

I am wondering if there is data structure that allows B to be met quickly as well.

A: 

A trivial idea, which will work in O(K) where K is size of element being added.

  • keep sets in whatever way u want
  • keep map of set_id -> set_size
  • keep map of object -> set_id

both A and B are O(K).

Drakosha
Note that an object can be a member of multiple sets, as in {{a, b}, {b, c}, {a, c}}.
Jason Orendorff
A: 

The trivial approach would be to keep a list of sets and perform a linear search through that for every incoming set (testing if the incoming is a subset).

This obviously runs in O(n) time for the linear search and possibly O(m) size for the size of the incoming set. Thus O(n*m) total time (number of sets vs. size of each set).

The most obvious optimization, of course, is to index on set sizes. Then you only test each incoming set against those which are of equal or larger size. (A set cannot be a subset of any smaller set, duh!).

The next optimization that comes to mind is to create in index of elements. Thus for each incoming set you'd find the intersection of each sets containing each of the elements. In other words if, for incoming set {a,b,c}, we find that element {a} exists in sets A, B, and D, element {b} exists in B, E, and F, and {c} exists in A, B, and Z ... then the incoming set is a subset of B (the intersection of {A, B, D}, {B, E, F}, and {A, B, Z}).

So, that sounds like O(m*log(n)) complexity to me. (We have to perform hashed searches on each element of each incoming set). Insertions should also be on the same order (inserting the new set's ID into each of the element's maps). (In Big-O analysis 2*O(m*log(n)) reduces down to O(m*log(n)), of course).

Jim Dennis
A: 

If the individual members of your sets A, B, ... are mapped to distinct (and relatively) prime numbers, and alongside each set you store the product of all the members as p(A), p(B) etc. then subsets and supersets can be found by whether p(X) is a factor of p(Y) or vice versa.

You could end up with some very large numbers I guess, but it works in theory.

For example:

if [a b c d] -> [2 3 5 7], p(abc) = 30, p(abd) = 42, p(bc) = 15, p(abcd) = 210

Hairy Jock
Isn't the problem of factoring numbers NP complete?
Stephen C
Not if you use a large number library with a divide function!
Hairy Jock
I should have added that in this case the problem is just division, not factoring.
Hairy Jock
@Stephen C: finding the factors of an arbitrary number is NP complete. The issue is does the smaller of the two numbers exactly divide the larger, a simple modulus operation. For example, p(bc)=15 is a divisor of p(abcd)=210, so bc is a subset of abcd. Adding a new set S to the existing set of N sets requires up to N modulus operations; fewer if S is a duplicate. For each existing entry E in the set of sets, if S is smaller than E, check if S divides E and ignore if it does. If S is larger than E, check whether E divides S, and remove E if it does (and add S). An array of BigNums would work.
Jonathan Leffler
novalis
That's better still - it's hard to see how anything could be more efficient than that. I think you need to check a new set against all existing sets though, only adding it at the end.
Hairy Jock
A: 

What a dorky site! I have now registered, so may in due course be able to comment on stuff from yesterday. Until then, however...

@Stephen C: although I believe my English to be adequate I seem to have acquired an explicator: he has missed bits out, however, and his comment should read as follows:

@Stephen C: finding the factors of an arbitrary number is indeed NP complete, but not relevant here. The issue is whether the smaller of two numbers exactly divides the larger, a simple modulus operation. For example, p(bc)=15 is a divisor of p(abcd)=210, so bc is a subset of abcd (as are sets abd and abc).

Adding a new set S to the existing collection of N sets is O(N), assuming that comparison and division of the large numbers takes roughly the same time regardless of N.

For each existing entry E in the collection of sets, stop if p(S) < p(E) and p(S) divides p(E) exactly. If p(S) = p(E), stop. Remove E if p(S) > p(E) and p(E) divides p(S) exactly. Add S if you get to the end of the collection. An array of BigNums would work.

@JL: if you'd like to be my on-site stalker please endeavour to 1) add value 2) accurately.

Hairy Jock