views:

377

answers:

4

I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million groups.

To simplify my example and get to the essence of it, let's assume that all groups contain 20 user IDs.

I want to find all pairs of integer sets that have an intersection of 15 or greater.

Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.) What is the quickest way to do this? That is, what should my underlying data structure be for representing the integer sets? Sorted sets, unsorted---can hashing somehow help? And what algorithm should I use to compute set intersection)? I prefer answers that relate C/C++ (especially STL), but also any more general, algorithmic insights are welcome.

Update Also, note that I will be running this in parallel in a shared memory environment, so ideas that cleanly extend to a parallel solution are preferred.

Also, note that the vast majority of set-pairs will have an intersection size of 0---meaning that it might be advantageous to use a data-structure that mapped user IDs to sets to avoid calculating the intersection of every pair of sets.

+3  A: 

If the vast majority of intersections are 0, that means the number of non-empty intersections is relatively small. Give this a try:

  • Throw away all sets of size <15 before you start
  • Calculate your lookup from userid -> list of sets to which it belongs
  • Create a map<pair<userset, userset>, int>
  • For each user, increment (after creating if necessary), n*(n-1)/2 entries of that map, where n is the number of sets to which the user belongs.
  • When that's finished, scan the map for entries where the value is greater than 15.

It will use more memory than the simple approach of computing every intersection. In fact it will run up against what's feasible: if each set on average intersects with just 10 others, perhaps in very small intersections, then the map needs 50M entries, which is starting to be a lot of RAM. It's also woefully cache-unfriendly.

It might be faster than doing all the set-intersections, because the O(n^2) terms relate to the number of non-empty intersections and the number of groups to which each user belongs, rather than to the number of sets.

Parallelizing isn't trivial, because of the contention on the giant map. However, you can shard that into a map for each thread, and periodically give one thread a new, empty, map and add the results-so-far into the total results. The different threads then run completely independently most of the time, each given a list of users to process.

Steve Jessop
I think you're on to something. It's simple, and exploits the "sparsity" of the set-intersection graph (imagine sets to be nodes, and that paris of sets are connected if intersetion is > 0---we have a sparse graph here.) If nobody proposes something better soon, I will accept this answer.
conradlee
@conradlee or @Steve could you please comment on my proposed solution?
Andreas Brinck
Andreas' answer might be better than this, I really don't know. If your graph really is *very* sparse then this is good, but if there are hundreds of millions of small intersections, you're going to run out of memory.
Steve Jessop
You might want to consider my answer too. It's simple, relatively memory efficient, and can be efficiently parallelized.
Philippe Beaudoin
So your solution has a complexity of O(U*g*log(a)), where U is the number of users, g is the average number of groups that users belong to, and a is the number of accepted pairs (the size of the set pair -> intersection map). I add the last log(a) factor because each increment to the set pair-> intersection map requires a look-up, which costs log(a). Phillipe, do you agree that this complexity is largely better than your proposed solution? Of course, it might use too much memory...
conradlee
No, this is O(U*g^2*log(a)) (for the main step), since each user requires updating the counts for *pairs* of groups. Comparing my `g*log(a)` against his `G` requires knowing a bit about the data set, so you can't get a proper complexity comparison between the two. Plus my version has an O(a) term in there somewhere - you have to actually create the map nodes. a is bounded by U, g and G, though, it's not a fully independent term.
Steve Jessop
Possibly you should ditch the log(a) term, actually, by using a hashmap instead of a map, either storing each set's hash value or hashing using its address and pointers for everything. Nobody has said anything about language except that I used C++ notation to avoid saying "map from (pairs of usersets) to (a counter)". So a map doesn't *have* to be a BST.
Steve Jessop
What do you think of Pete Kirkham's last idea below? It seems to address the problem of running this in parallel (I'll run this on a machine with 24 cores and 128gb ram, btw.) It's basically like your suggestion, but instead of incrementing a map each time you find a pair of sets that exist in some user's membership container, you add that pair to a big container of pairs of sets. Then you can sort this container (which is a straightforward parallel operation), and count up how many of each pair exists in the sorted container.
conradlee
Mine relies on fiddly implementation details like the memory allocator used for the map nodes behaving well multi-threaded, and the custom code you write to merge the map shards. But his also relies on several specific optimisations, which are good but introduce algorithmic fiddliness :-)
Steve Jessop
I was assuming a hashmap in my implementation too, although with a BST I'd get an extra `log(Ug)` term which is probably negligible.
Philippe Beaudoin
A: 

One way is to see your problem as a metric space radius search problem where the distance function is the number of not matching entries and the radius is r = max(number of elements in sets) - number of equal. Filtering of the found elements is needed to see that enough values are in the Set. So unless someone comes up with a metric function that can be used directly this solutions has to much constraints.

One of the data structures for metric search is the BK-Tree that can be used for string similarity searches.

Candidates for your problem are VP-tree and M-Trees.

The worst case for metric tree is O(n^2) when you're searching for a distance > m (maximum number of elements in the sets) as you build up the tree in O(log n * n) and search in O(n^2).

Apart for that the actual runtime complexity depends on the abbility to prune subtrees of the metric tree while executing the search. In a metric tree a subtree can be skipped if the distance of the pivot element to the search element is larger than the radius of the pivot element (which is at least the maximum distance of a ancestores to the pivot element). If your entry sets are rather disjunct the overall runtime will be dominated by the build-up time of the metric tree O(log n * n).

Thomas Jung
Can you give any estimate of the complexity of a solution based on this idea? How does it compare to Philippe Beaudoin's solution? Also, given that I don't want to implement any of these trees, are there any good implementations I could use (preferably in C++)?
conradlee
@conradlee - See edit.
Thomas Jung
What's the metric which defines this metric space? Your stated distance function does not define a metric space. The required properties are `d(x,x) == 0`, `d(x,y) == d(y,x)`, and (trickiest) `d(x,z) <= d(x,y) + d(y,z)`.
Steve Jessop
@Steve Your right. The distance function is the number of different elements of two sets.
Thomas Jung
@Steve The problem is that you're searching with an r = max(number of elements in sets) - number of equal needed.
Thomas Jung
That is a metric, but then how do you use the metric to focus on sets which have (at least) 15 elements in common? The sets `{1 ... 15, 16}` and `{1 .. 15, 17}` are at distance 1, and have 15 elements in common. The sets `{1 .. 15}` and `{1 .. 14, 16}` are at distance 1 and don't have 15 elements in common. The sets `{1 .. 100}` and `{1 .. 50, 101 .. 150}` are at distance 50 and have 15 elements in common. And that's before worrying about sets of difference sizes.
Steve Jessop
@Steve That's obvious. That a filtering is needed to be correct.
Thomas Jung
@Steve Unless there is a approriate metric function to answer it directly the metric tree is useless.
Thomas Jung
Sorry, I started writing my comment "that is a metric but...", before I saw your comment "the problem is that...". Anyway, that is my fear, that Hamming distance isn't a helpful metric here.
Steve Jessop
+2  A: 

I would do exactly what you propose: map users to their group. That is, I would keep a list of group ids for every user. Then I would use the following algorithm:

foreach group:
  map = new Map<Group, int>  // maps groups to count
  foreach user in group:
    foreach userGroup in user.groups:
      map[userGroup]++
      if( map[userGroup] == 15 && userGroup.id > group.id )
        largeIntersection( group, userGroup )

Given you have G groups each containing U users on average, and given that these users belong to g groups on average, then this will run in O( G*U*g ). Which, given your problem, is probably much faster than the naive pairwise comparison of groups which runs in O(G*G*U).

Philippe Beaudoin
Can't you fix the "called multiple times" issue by changing `>15` to `==15`. Then you'd be calling it exactly twice for each result (plus once for each big enough set with itself, which probably isn't desired). That you fix by using some ordering of groups to do `foreach userGroup in user.groups where userGroup > group:`.
Steve Jessop
@Steve Good idea! Then you'll still get it exactly twice: once with (group, userGroup) and once with (userGroup, group).
Philippe Beaudoin
You can remedy this last problem by only calling largeIntersection( group, userGroup) if group > userGroup. This also fixes another "bug" in your code: you should not be keeping track of how much one group intersects with itself!
conradlee
@conradlee. Thanks, I've updated it.
Philippe Beaudoin
+1  A: 

Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.)

In order to count the degree of intersection, you still need to visit the other groups the user has, which is still cubic. You could have a hash table or other sparse array to count, but it would still at best require an increment for each user for each pair of groups each user is in. If you have N users in G groups with an average of S users per group and T number of groups each user is in, you've got G*G*S/2 for comparing each pair of groups and N*T*T if you have an index of user to group. T = G*S/N, so N*T*T=G*G*S*S/N; for S=20 and N in the millions there should be an advantage. Unfortunately, you also need at least G*G storage for the intersection count (25 TB or so for a 4 bit non-sparse counter), and have to ensure the structure can be incremented in parallel.

For a million users in 10 million groups of 20, very approximately the probability that a user is in a given group is 2e-6, and the probability that two groups will share users will be 40e-6, so the 25TB comes down to 1GB data, so not impossible for a sparse array on an ordinary sized computer.

However, comparing set of 20 elements for 15 elements in common has more obvious optimisations

  • If the groups are sorted, you require no working storage, just output the degree of difference between the input groups directly.
  • Most of the memory accesses will be linear in contiguous memory areas, and the results are dependent only on the two sets being compared, rather than relying on summing across the whole data set. Accessing main memory randomly is significantly slower than accessing it linearly. Altering main memory randomly with bus locks is orders of magnitude slower than accessing cache without the need to lock the bus (though if you have a couple of GB per core you can use the user->group approach without having to do any synchronisation).
  • Only need to count 5 elements which are distinct between the sets; if the data is random then most sets will be disjoint, so the average number of elements visited is smaller.
  • You can quickly discount certain groups by treating the difference as a distance (if A is 11 different from B, and C is 5 different from B, then C is between 6 and 16 different from A, so can be discounted without comparing A and C directly). Since most the sets are completely disjoint, this won't gain you much though.

There is also the option of a hybrid approach, using the user->group map to prune the set of group to group comparisons to be made. This has the advantage of not requiring incrementing of a shared data structure:

  • for each pair of groups that a user is in, add that pair to the list to investigate.
  • sort the list of pairs of groups with at least one user in common.
  • the number of times each pair occurs in the list is the number of users they have in common.

Using merge sort, this is very each to parallelise into pure streaming units. You would be sorting around 20*200*10 million/2 = 20 billion group ID pairs (each group of 20 users times the number of groups each user is in/2).

Pete Kirkham