views:

336

answers:

5

Given a list of objects with multiple attributes I need to find the list of sets created by a union of all intersecting subsets.

Specifically these are Person objects, each with many attributes. I need to create a list of 'master' sets based on a handful of unique identifiers such as SSN, DLN, etc.

For instance, if Person A and Person B have the same SSN they create a set i. Then if Person B and C have the same DLN, they create a set ii. Person D and E have the same SSN but it (and all other identifiers) does not match any of the identifiers of Persons A, B or C. After merging all intersecting subsets I would end up with one set with Persons A,B,C and another set with Persons D, E.

Here is the psuedo-code for my solution. I am curious if anyone has already come up with a more efficient way of merging all possible intersecting sets. Keep in mind that the links between sets could be X Persons long (i.e. A matches B by SSN and B matches C by DLN and C matches D by SSN and D matches E by some other identifier would result in Persons A-E in one set). Also assume that the language this will be implemented in supports set operations.

bigSetList = array of all of the uniq Sets
fullyTested = false
while (bigSetList.size() > 1) or (fullyTested is false)
    foreach thisSet in bigSetList  order by size desc
     if count(sets that intersect with thisSet) > 0
      newThisSet = thisSet
      intersectingSets = []
      bigSetList.delete(thisSet)
      foreach testSet in bigSetList
       if thisSet.intersects(testSet)
        newThisSet.addAll(testSet)
        intersectingSets.push(testSetID)
       end if
      end
      bigSetList.delete(intersectingSets)
      bigSetList.push(newThisSet)
      bigSetList.sort()
      break
     end if
    end foreach
    fullyTested = true  // have looped through every set in the list and found 0 intersect partners
end
A: 

I would guess that you have a relatively small set of attributes for the Person object (as compared to the number of Person objects you're considering). If you want to reduce traversing the list of Person objects multiple times, you can take a Person, put its attributes into a list of known possible connections and then move on to the next Person. With each successive Person, you see if it is connected to any prior connection. If so, then you add its unique attributes to the possible connections. You should be able to process all Person objects in one pass. It's possible that you'll have some disconnected sets in the results, so it may be worth examining the unconnected Person objects after you've created the first graph.

Bernard Chen
A: 
JPS
This approach is missing a Step 6 which would then continuously try to merge every multi-collection until a pass was made where nothing merged.
Sugerman
I have now updated the explanation to show why Sugerman's comment is not needed.
JPS
A: 
while (!people.isEmpty()) {
    Person first = people.get(0);
    people.remove(first);
    Set<Person> set = makeSet(first);
    for (Person person : people) {
     for (Person other : set) {
      if (person.isRelatedTo(other)) {
       set.add(person);
       people.remove(person);
      }
     }
    }
    sets.add(set);
}
for (Set<Person> a : sets) {
    for (Set<Person> b : sets.except(a)) {
     for (Person person : a)
      for (Person other : b) {
       if (person.isRelatedTo(other)) {
        a.addAll(b);
        b.clear();
        sets.remove(b);
        break;
       }
      }
    }
}
Carl Manaster
+4  A: 

To expand on my comment in the original post, you want to create a list of sets where each member of a given set shares at least one attribute with at least one other member of that set.

Naively, this can be solved either by finding all pairs that share an attribute and merging pairs together that have the same partner iteratively. This would be O(N^3) (N^2 for iterating over pairs, and up to N separate sets to determine membership).

You can also think of this problem as determining the connected component of a graph, where every object and every unique attribute value is a node; each object would be connected to each of its attribute values. Setting up that graph would take linear time, and you could determine the connected components in linear time with a breadth or depth first search.

MSN
+1 for mentioning the linear-time connected components algorithm. http://en.wikipedia.org/wiki/Strongly_connected_component has links to the canonical choices.
Dave
Thank you very much for the information.
Sugerman
On second thought, I do not think this approach will work with an undirected graph.
Sugerman
As part of the traversal you need a way to mark nodes so that you don't visit them again. That takes care of backwards traversal.
MSN
It's also a bit of overkill since you basically have to do the same work as finding elements in a set.
MSN
A: 

First, is there some inherent hierarchy in identifiers, and do contradicting identifiers of a higher sort cancel out the same identifier of a lower sort? For example, if A and B have the same SSN, B and C have the same DLN, and C and D have the same SSN which does not match A and B's SSN, does that mean that there are two groups or one?

Assuming contradictions don't matter, you are dealing with equivalence classes, as user 57368 (unknown Google) states. For equivalence classes, people often turn to the Union-find structure. As for how to perform these unions, it's not immediately trivial because I assume you don't have the direct link A-B when both A and B have the same SSN. Instead, our sets will consist of two kinds of elements. Each (attribute type, attribute value) = attribute pair is an element. You also have elements corresponding to objects. When you iterate through the list of attributes for an object, perform the union (object, attribute).

One of the important features of the Union-find data structure is that the resulting structure represents the set. It lets you query "What set is A in?" If this is not enough, let us know and we can improve the result.

But the most important feature is that the algorithm has something which resembles constant-time behavior for each union and query operation.

Martin Hock