Context: Consider each set within G
to be a collection of the files (contents or MD5 hashes, not names) that are found on a particular computer.
Suppose I have a giant list of giant sets G
and an unknown to me list of sets H
. Each individual set I
in G
was created by taking the union of some unknown number of sets from list H
, then adding and removing an unknown number of elements.
Now, I could use other data to construct a few of the sets in list H
. However, I feel like there might be some sort of technique involving Bayesian probability to do this. E.g. something like, "If finding X
in a set within G
means there is a high probability of also finding Y
, then there is probably a set in H
containing both X
and Y
."
Edit: My goal is to construct a set of sets that is, with high probability, very similar or equal to H
.
Any thoughts?
Example usage:
Compress G
by replacing chunks of it with pieces of H
, e.g.
G[1] = {1,2,3,5,6,7,9,10,11}
H[5] = {1,2,3}
H[6] = {5,6,7,8,9,10}
G[1]' = {H[5],H[6],-8,11}