Context: Consider each set within G to be a collection of the files (contents or MD5 hashes, not names) that are found on a particular computer.
Suppose I have a giant list of giant sets G and an unknown to me list of sets H. Each individual set I in G was created by taking the union of some unknown number of sets from list H, then adding and removing an unknown number of elements.
Now, I could use other data to construct a few of the sets in list H. However, I feel like there might be some sort of technique involving Bayesian probability to do this. E.g. something like, "If finding X in a set within G means there is a high probability of also finding Y, then there is probably a set in H containing both X and Y."
Edit: My goal is to construct a set of sets that is, with high probability, very similar or equal to H.
Any thoughts?
Example usage:
Compress G by replacing chunks of it with pieces of H, e.g.
G[1] = {1,2,3,5,6,7,9,10,11}
H[5] = {1,2,3}
H[6] = {5,6,7,8,9,10}
G[1]' = {H[5],H[6],-8,11}