ansaurus

Question

Finding Common Sets within noisy data

Answer 1

A:

Jason Orendorff 2009-12-17 21:03:30

Answer 2

A:

Well, the current ad-hoc way, which seems to be good enough, is as follows:

Remove all elements from all G_x that are in under 25 sets.
Create a mapping from element to set and from set to element.
For each element E in the element map, pick 3 sets and take their intersection. Make two copies of this, A and B.
For each set S in the set map that does not contain E, remove all elements of S from A or B (alternate between them)
Add Union(A,B) to H
Remove all elements of `Union(A,B) from the element to set map (i.e. do not find overlapping sets).

Brian 2009-12-17 21:18:20

Answer 3

+3 A:

Define the distance d(i,j) = 1/(number of sets in G which contain both i and j) and then run a cluster analysis.(http://en.wikipedia.org/wiki/Cluster%5Fanalysis) The resulting clusters are your candidates for the elements in H.

Jens Schauder 2009-12-17 21:32:40

+1 for pointing out the correct place to look for info on this kind of problem.

Brian 2009-12-17 21:43:03

I don't think that's quite how you want to define distance, but the link alone is worth an upvote.

Jason Orendorff 2009-12-17 21:44:59

The reason I think d(i, j)=1/(number of sets in G which contain both i and j) isn't the right distance function is that it's very small when i and j are both very common but not at all correlated, and much larger when i and j are both rare but highly correlated.

Jason Orendorff 2009-12-17 21:46:54

Answer 4

A:

How about a deterministic way (if you do not wish sets to intersect at all): A) Turn sets in H into vertices labeled 1, 2, 3, ... size(H). Create a complete [un] directed graph between them all. Each vertex gets a value - equal to the cardinality / size of the set. B) Go through all elements x in sets in H, create a mapping x -> [x1, x2, ... xm] if and only if x is in H[xi]. An array of sets will do. This helps you find overlapping sets. C) Go through through all sets in this array, for every pair of x1, x2 that are within the same set - eliminate two edges between x1 and x2. D) In the remaining graph only non-overlapping sets (well, their indices in H). E) Now find the non-intersecting path within this graph with the highest total value. From this you can reconstruct the list of non-intersecting sets with highest coverage. It is trivial to compute the missing elements. F) If you want to minimize the cardinality of the remaining set, then subtract 0.5 from the value of each vertex. We know that 1 + 1 = 2, but 0.5 + 0.5 < 1.5 - so the algorithm will prefer a set {a,b} over {a} and {b}. This may not be exactly what you want, but it might expire you.

Hamish Grubijan 2009-12-18 14:40:04

I don't have H.

Brian 2009-12-18 15:04:45

Hm ... please enlighten me on G with examples. How do you supposedly construct G[2]? Thanks.

Hamish Grubijan 2009-12-18 15:50:15

ansaurus

tags:

views:

answers:

Finding Common Sets within noisy data

related questions