views:

108

answers:

4

Context: Consider each set within G to be a collection of the files (contents or MD5 hashes, not names) that are found on a particular computer.

Suppose I have a giant list of giant sets G and an unknown to me list of sets H. Each individual set I in G was created by taking the union of some unknown number of sets from list H, then adding and removing an unknown number of elements.

Now, I could use other data to construct a few of the sets in list H. However, I feel like there might be some sort of technique involving Bayesian probability to do this. E.g. something like, "If finding X in a set within G means there is a high probability of also finding Y, then there is probably a set in H containing both X and Y."

Edit: My goal is to construct a set of sets that is, with high probability, very similar or equal to H.

Any thoughts?

Example usage:

Compress G by replacing chunks of it with pieces of H, e.g.

G[1]  = {1,2,3,5,6,7,9,10,11}
H[5]  = {1,2,3}
H[6]  = {5,6,7,8,9,10}
G[1]' = {H[5],H[6],-8,11}
A: 
Jason Orendorff
A: 

Well, the current ad-hoc way, which seems to be good enough, is as follows:

  • Remove all elements from all G_x that are in under 25 sets.
  • Create a mapping from element to set and from set to element.
  • For each element E in the element map, pick 3 sets and take their intersection. Make two copies of this, A and B.
  • For each set S in the set map that does not contain E, remove all elements of S from A or B (alternate between them)
  • Add Union(A,B) to H
  • Remove all elements of `Union(A,B) from the element to set map (i.e. do not find overlapping sets).
Brian
+3  A: 

Define the distance d(i,j) = 1/(number of sets in G which contain both i and j) and then run a cluster analysis.(http://en.wikipedia.org/wiki/Cluster%5Fanalysis) The resulting clusters are your candidates for the elements in H.

Jens Schauder
+1 for pointing out the correct place to look for info on this kind of problem.
Brian
I don't think that's quite how you want to define distance, but the link alone is worth an upvote.
Jason Orendorff
The reason I think d(i, j)=1/(number of sets in G which contain both i and j) isn't the right distance function is that it's very small when i and j are both very common but not at all correlated, and much larger when i and j are both rare but highly correlated.
Jason Orendorff
A: 

How about a deterministic way (if you do not wish sets to intersect at all): A) Turn sets in H into vertices labeled 1, 2, 3, ... size(H). Create a complete [un] directed graph between them all. Each vertex gets a value - equal to the cardinality / size of the set. B) Go through all elements x in sets in H, create a mapping x -> [x1, x2, ... xm] if and only if x is in H[xi]. An array of sets will do. This helps you find overlapping sets. C) Go through through all sets in this array, for every pair of x1, x2 that are within the same set - eliminate two edges between x1 and x2. D) In the remaining graph only non-overlapping sets (well, their indices in H). E) Now find the non-intersecting path within this graph with the highest total value. From this you can reconstruct the list of non-intersecting sets with highest coverage. It is trivial to compute the missing elements. F) If you want to minimize the cardinality of the remaining set, then subtract 0.5 from the value of each vertex. We know that 1 + 1 = 2, but 0.5 + 0.5 < 1.5 - so the algorithm will prefer a set {a,b} over {a} and {b}. This may not be exactly what you want, but it might expire you.

Hamish Grubijan
I don't have H.
Brian
Hm ... please enlighten me on G with examples. How do you supposedly construct G[2]? Thanks.
Hamish Grubijan