views:

515

answers:

3

I have a dataset of ca. 9K lists of variable length (1 to 100K elements). I need to calculate the length of the intersection of all possible 2-list combinations in this dataset. Note that elements in each list are unique so they can be stored as sets in python.

What is the most efficient way to perform this in python?

Edit I forgot to specify that I need to have the ability to match the intersection values to the corresponding pair of lists. Thanks everybody for the prompt response and apologies for the confusion!

+2  A: 

As you need to produce a (N by N/2) matrix of results, i.e., O(N squared) outputs, no approach can be less than O(N squared) -- in any language, of course. (N is "about 9K" in your question). So, I see nothing intrinsically faster than (a) making the N sets you need, and (b) iterating over them to produce the output -- i.e., the simplest approach. IOW:

def lotsofintersections(manylists):
  manysets = [set(x) for x in manylists]
  moresets = list(manysets)
  for  s in reversed(manysets):
    moresets.pop()
    for z in moresets:
      yield s & z

This code's already trying to add some minor optimization (e.g. by avoiding slicing or popping off the front of lists, which might add other O(N squared) factors).

If you have many cores and/or nodes available and are looking for parallel algorithms, it's a different case of course -- if that's your case, can you mention the kind of cluster you have, its size, how nodes and cores can best communicate, and so forth?

Edit: as the OP has casually mentioned in a comment (!) that they actually need the numbers of the sets being intersected (really, why omit such crucial parts of the specs?! at least edit the question to clarify them...), this would only require changing this to:

  L = len(manysets)
  for i, s in enumerate(reversed(manysets)):
    moresets.pop()
    for j, z in enumerate(moresets):
      yield L - i, j + 1, s & z

(if you need to "count from 1" for the progressive identifiers -- otherwise obvious change).

But if that's part of the specs you might as well use simpler code -- forget moresets, and:

  L = len(manysets)
  for i xrange(L):
    s = manysets[i]
    for j in range(i+1, L):
      yield i, j, s & manysets[z]

this time assuming you want to "count from 0" instead, just for variety;-)

Alex Martelli
@Alex Martelli: Would you please explain more about why this code is faster than itertools.combination? My timeit tests suggests this is the case, but I find that disturbing since "the one obvious way to do it" (itertools.combination) does not appear to be the fastest way to do it in this case. Should be always be using your code instead of combinations?
unutbu
Thanks Alex, I am not planning to run this on a cluster so it might be quite demanding indeed. I too am interested in hearing about known differences in performance between different methods.
radrat
RedGlyph
@unutbu: I've done a few measurements and both yield very close timings (within 0.1% for minute-long tests), regardless of the list and/or set sizes. Did you observe something much different?
RedGlyph
unutbu
@RedGlyph: When I fix the error, it appears itertools.combinations is perhaps faster than reversepop: http://paste.ubuntu.com/321928/.`using combinations: 23.222 secs``using reversepop : 26.909 secs`
unutbu
@unutbu: OK, I see how that would explain the difference. It would probably depend a little on the implementation too (I'm using CPython 2.6.4 on Win32), but I've also noticed `itertools.combinations` tended to use less memory, although I didn't check how it was actually coded to find a formal explanation. Thanks for double-checking it :-)
RedGlyph
@RedGlyph, I just edited the answer to avoid "intersections with itself" (which literally may break the specs in the Q, which don't specify 2-**distinct**-list combinations, but rather **all** possible ones, but is probably what the OP meant). I'm not surprised if itertools is nevertheless still slightly faster, since it's coded so well and so fast (the extra memory use, less than 36 KB extra for my case, shd be trivial as a fraction of memory needed anyway, so I don't worry about that;-). But: I often use App Engine, so 2.5-compatible solutions are still very interesting to me at least;-).
Alex Martelli
@Alex: I hesitated on the interpretation because of the emphasis on _all possible_, but just assumed it wouldn't make much sense, yet I may be wrong. And yes, `combinations()` is new since 2.6, you are right! I should have mentioned that.
RedGlyph
+2  A: 

If your sets are stored in s, for example:

s = [set([1, 2]), set([1, 3]), set([1, 2, 3]), set([2, 4])]

Then you can use itertools.combinations to take them two by two, and calculate the intersection (note that, as Alex pointed out, combinations is only available since version 2.6). Here with a list comrehension (just for the sake of the example):

from itertools import combinations
[ i[0] & i[1] for i in combinations(s,2) ]

Or, in a loop, which is probably what you need:

for i in combinations(s, 2):
    inter = i[0] & i[1]
    # processes the intersection set result "inter"

So, to have the length of each one of them, that "processing" would be:

    l = len(inter)

This would be quite efficient, since it's using iterators to compute every combinations, and does not prepare all of them in advance.


Edit: Note that with this method, each set in the list "s" can actually be something else that returns a set, like a generator. The list itself could simply be a generator if you are short on memory. It could be much slower though, depending on how you generate these elements, but you wouldn't need to have the whole list of sets in memory at the same time (not that it should be a problem in your case).

For example, if each set is made from a function gen:

def gen(parameter):
    while more_sets():
        # ... some code to generate the next set 'x'
        yield x

with open("results", "wt") as f_results:
    for i in combinations(gen("data"), 2):
        inter = i[0] & i[1]
        f_results.write("%d\n" % len(inter))


Edit 2: How to collect indices (following redrat's comment).

Besides the quick solution I answered in comment, a more efficient way to collect the set indices would be to have a list of (index, set) instead of a list of set.

Example with new format:

s = [(0, set([1, 2])), (1, set([1, 3])), (2, set([1, 2, 3]))]

If you are building this list to calculate the combinations anyway, it should be simple to adapt to your new requirements. The main loop becomes:

with open("results", "wt") as f_results:
    for i in combinations(s, 2):
        inter = i[0][1] & i[1][1]
        f_results.write("length of %d & %d: %d\n" % (i[0][0],i[1][0],len(inter))

In the loop, i[0] and i[1] would be a tuple (index, set), so i[0][1] is the first set, i[0][0] its index.

RedGlyph
> depending on how you generate these elementsThe sets are imported from flat text files and cannot be generated on the fly unfortunately.
radrat
They _could_ be but that would be very slow (jumping at cached positions, reading them... would indeed be a bit daunting). In this case I would stick to the first method, which does not increase the memory consumption anyway - so long as you can hold your original list of sets in memory you are fine. The _Edit_ part idea was more to show the flexibility.
RedGlyph
@RedGlyph, using combinations and the above method I can obtain the intersections between any two items and the length of the intersection indeed. However I understand I would not be able to access the position in s of the two items being compared, right?My final output needs to be a series of triples containing the ids of the two sets (e.g. if they are defined in a dictionary, or at least their positions in s) and the length of their intersection. Can your solution based on itertools be used for this purpose?
radrat
RedGlyph
@redrat, this (http://paste.ubuntu.com/322146/) is a continuation of RedGlyph's idea, except using enumerate to collect the indices. Since `enumerate` returns an iterator, it can be daisy-chained to itertools.combinations nicely. I haven't tested it, but I think this is faster than accessing elements of `s` via indexing: `s[x]`.
unutbu
@redrat, I've added an example with indices which is more efficient than my last answer (it was a bit late...). Make sure to also check unutbu's code in the previous comment, it's more elegant IMO :-)
RedGlyph
thanks guys, I'll give a try to all of the above options and report back. (oh and how unfortunate the choice of "radrat" as a nickname...)
radrat
A: 

Hi,

Try this:

_lists = [[1, 2, 3, 7], [1, 3], [1, 2, 3], [1, 3, 4, 7]]
_sets = map( set, _lists )
_intersection = reduce( set.intersection, _sets )

And to obtain the indexes:

_idxs = [ map(_i.index, _intersection ) for _i in _lists ]

Cheers,

José María García

PS: Sorry I misunderstood the question

Jose M.