views:

534

answers:

6

Here is a seemingly simple problem: given a list of iterators that yield sequences of integers in ascending order, write a concise generator that yields only the integers that appear in every sequence.

After reading a few papers last night, I decided to hack up a completely minimal full text indexer in Python, as seen here (though that version is quite old now).

My problem is with the search() function, which must iterate over each posting list and yield only the document IDs that appear on every list. As you can see from the link above, my current non-recursive 'working' attempt is terrible.

Example:

postings = [[1,   100, 142, 322, 12312],
            [2,   100, 101, 322, 1221],
            [100, 142, 322, 956, 1222]]

Should yield:

[100, 322]

There is at least one elegant recursive function solution to this, but I'd like to avoid that if possible. However, a solution involving nested generator expressions, itertools abuse, or any other kind of code golf is more than welcome. :-)

It should be possible to arrange for the function to only require as many steps as there are items in the smallest list, and without sucking the entire set of integers into memory. In future, these lists may be read from disk, and larger than available RAM.

For the past 30 minutes I've had an idea on the tip of my tongue, but I can't quite get it into code. Remember, this is just for fun!

+6  A: 
def postings(posts):
    sets = (set(l) for l in posts)
    return sorted(reduce(set.intersection, sets))

... you could try and take advantage of the fact that the lists are ordered, but since reduce, generator expressions and set are all implemented in C, you'll probably have a hard time doing better than the above with logic implemented in python.

Aaron Maenpaa
Nice! Although, this duplicates the entirety of the posting lists, simply to perform the match. It should be possible to do this without resorting to a hash table, or a large copy.
David Wilson
Actually, it doesn't duplicate the entire postings list. sets is a generator, which will yield each set as needed, but never the whole thing at once.
Triptych
Very nice. So the memory overhead would on be the size of a single posting list.
David Wilson
@dmwmd, that's correct. More accurately the average case memory overhead would be O(n+i) where n is the average size of a list and i is the average size of the intersection.
Triptych
The question is whether the clever code in python will be faster than the copy-based code in C.
Douglas Leeder
+1: on the example provided, it is ~ 3 times faster than my dumb algo :)
NicDumZ
Sorry for -1, but I don't think anything set-based would work if the iterators aren't just simple arrays, e.g. the iterators transmit 10GB of data each from 10 network servers as real-time results of complicated high-energy physics experiments. There is a more algorithmic solution below that doesn't store the data.
ilya n.
@ilya I see where you're coming from, it won't work with infinite generators either; however, there's a lot to be said for something that's three lines long. After all not everybody is doing high energy physics experiments (besides even if you were, you could easily have a machine with 64GiB of RAM, it's not that expensive).
Aaron Maenpaa
+3  A: 

If these are really long (or even infinite) sequences, and you don't want to load everything into a set in advance, you can implement this with a 1-item lookahead on each iterator.

EndOfIter = object() # Sentinel value

class PeekableIterator(object):
    def __init__(self, it):
        self.it = it
        self._peek = None
        self.next() # pump iterator to get first value

    def __iter__(self): return self

    def next(self):
        cur = self._peek
        if cur is EndOfIter:
            raise StopIteration()

        try:
            self._peek = self.it.next()
        except StopIteration:
            self._peek = EndOfIter
        return cur

    def peek(self): 
        return self._peek


def contained_in_all(seqs):
   if not seqs: return   # No items
   iterators = [PeekableIterator(iter(seq)) for seq in seqs]
   first, rest = iterators[0], iterators[1:]

   for item in first:
       candidates = list(rest)
       while candidates:
           if any(c.peek() is EndOfIter for c in candidates): return  # Exhausted an iterator
           candidates = [c for c in candidates if c.peek() < item]
           for c in candidates: c.next()

       # Out of loop if first item in remaining iterator are all >= item.
       if all(it.peek() == item for it in rest):
           yield item

Usage:

>>> print list(contained_in_all(postings))
[100, 322]
Brian
+1: very elegant, thank you.
NicDumZ
And it's of course much, much, more efficient than other approaches.
NicDumZ
but for completeness, you might want to check that iterators[0] does exist :)
NicDumZ
Good point - updated to check seqs is not empty.
Brian
This is wonderful, and even seems to work. :)In the meantime I wrote a 'recursive' version, which looks more concise, but probably at the cost of CPU.
David Wilson
I think this solution will run unnecessarily long on input like [[1000000], range(1000000), [1]] where it will run through and exhaust range(1000000) before examining that the [1] sequence.
Martin Geisler
(I posted a solution that avoids this below.)
Martin Geisler
it might be interesting to sort the iterators according to the list lengths: seqs = sorted(seqs, cmp=lambda x, y:len(x).__cmp__(len(y)))
NicDumZ
@Martin - you're right, I've switched to a breadth first approach to avoid this, so it should now be proportional to the shortest sequence rather than the longest in most cases.
Brian
@NicDumz: Sorting on length means you need more than just iterators, so you lose the ability to deal with non-list sequences - potentially the sequence consumed could even be infinite in length.
Brian
+1  A: 

What about this:

import heapq

def inalliters(iterators):
  heap=[(iterator.next(),iterator) for iterator in iterators]
  heapq.heapify(heap)
  maximal = max(heap)[0]
  while True:
    value,iterator = heapq.heappop(heap)
    if maximal==value: yield value
    nextvalue=iterator.next()
    heapq.heappush(heap,(nextvalue,iterator))
    maximal=max(maximal,nextvalue)

postings = [iter([1,   100, 142, 322, 12312]),
            iter([2,   100, 101, 322, 1221]),
            iter([100, 142, 322, 956, 1222])]
print [x for x in inalliters(postings)]

I haven't tested it very thoroughly (just ran your example), but I believe the basic idea is sound.

J S
+6  A: 

This solution will compute the intersection of your iterators. It works by advancing the iterators one step at a time and looking for the same value in all of them. When found, such values are yielded -- this makes the intersect function a generator itself.

import operator

def intersect(sequences):
    """Compute intersection of sequences of increasing integers.

    >>> list(intersect([[1,   100, 142, 322, 12312],
    ...                 [2,   100, 101, 322, 1221],
    ...                 [100, 142, 322, 956, 1222]]))
    [100, 322]
    """
    iterators = [iter(seq) for seq in sequences]
    last = [iterator.next() for iterator in iterators]
    indices = range(len(iterators) - 1)
    while True:
        # The while loop stops when StopIteration is raised. The
        # exception will also stop the iteration by our caller.
        if reduce(operator.and_, [l == last[0] for l in last]):
            # All iterators contain last[0]
            yield last[0]
            last = [iterator.next() for iterator in iterators]

        # Now go over the iterators once and advance them as
        # necessary. To stop as soon as the smallest iterator is
        # exhausted we advance each iterator only once per iteration
        # in the while loop.
        for i in indices:
            if last[i] < last[i+1]:
                last[i] = iterators[i].next()
            if last[i] > last[i+1]:
                last[i+1] = iterators[i+1].next()
Martin Geisler
Your solution is what I wanted to write, if only I knew Python well enough...
ilya n.
Nice. You could replace the reduce with all() instead - you'll also get short circuiting that way too.
Brian
@Brian: true, but all is not in Python 2.4 which is the version I normally target :-)
Martin Geisler
The only tiny improvement I could find to make is doing "range(len(iterators)-1)", and not slicing indices later. Otherwise, this solution rocks. :) Thanks.
David Wilson
@dmwmd: yeah, I was debating this myself and you're right that it's probably better like that.
Martin Geisler
+1  A: 

I want to show that there's an elegant solution, which only iterates forward once. Sorry, I don't know the Python well enough, so I use fictional classes. This one reads input, an array of iterators, and writes to output on-the-fly without ever going back or using any array function!.

    def intersect (input, output) 
        do:
            min = input[0]
            bingo = True
            for i in input:
                if (i.cur < min.cur):
                     bingo = False
                     min =  i
            if bingo: 
                output.push(min.cur)
        while (min.step())
ilya n.
This is nice -- I wrote a solution above which does essentially this. I use a list to store the last values seen for each iterator since iterators doesn't have a .cur attribute like you use. But apart from this the solutions are almost identical.
Martin Geisler
+7  A: 
import heapq, itertools
def intersect(*its):
    for key, values in itertools.groupby(heapq.merge(*its)):
     if len(list(values)) == len(its):
      yield key

>>> list(intersect(*postings))
[100, 322]
Coady
Awesome! I knew this must have been in the standard library. Sadly only for Python 2.6, but that's OK.
David Wilson
Wow, cool solution!
Martin Geisler
Nice solution, although it assumes that integers are never repeated within a single iterator, which is not an assumption the OP allowed.posting = [[100,100],[1,1]] returns [100,1] even though no values are repeated across lists.
Triptych
Fair enough, but it is common to assume 'ascending' means strictly. Best to say 'monotonically ascending' if that's what the OP meant.
Coady
Or "nondescending". My reading of the question was also that the iterators were intended to produce strictly ascending results.
ephemient