views:

46

answers:

0

I want to iterate over the intersection of n sub-iterators. Each sub-iterator is sorted from GREATEST TO LEAST, and the values must be accessed sequentially. Assume no duplicate values.

So far I have thought of two approaches using a heap/priority queue, but I'm struggling to figure out which is more efficient - it seems like it may be data dependent:

========== APPROACH 1

// Return false when there are no more intersecting values
// This method is on the top-level intersection iter, not the sub-iter
// _heap is sorted from **GREATEST TO LEAST**
bool next()
{
    while (true) {

        // _heap is sorted from greatest to least
        _currentSubIter = _heap.pop()
        _currentValue = _currentSubIter.getValue()

        // Push the current iter back on the heap unless it is empty
        if (_currentSubIter.next()) {
            _heap.push(_currentSubIter)
        }

        int numMatches = 1

        // Loop until next interseting match is found
        while (!numMatches < _subIterCount &&
               _currentValue == _heap.top().getValue()) {

            // pop top sub-iter and increment
            // push sub-iter back on the heap unless it is empty

            ++numMatches
        }

        if (_numMatches == _subIterCount) {
            // Found the same value in all sub-iters 
            return true 
        } else if (_heap.size() < _numSubIters) {
            // One or more sub-iters have been exhausted
            // so there are no more possible intersections
            return false
        }

        // Didn't match, try again
    }
}

========== APPROACH 2

// Return false when there are no more intersecting values
// This method is on the top level intersection iter, not the sub-iter
// _heap is sorted from **LEAST TO GREATEST**
bool next()
{
    // track whether we have exhausted a sub-iter
    bool exhaustedIter = false

    while (true) {
        _currentSubIter = _heap.top()
        _currentValue = _currentSubIter.getValue()

        if (!_currentSubIter.next()) {
            exhaustedIter = true
        }

        // count matching values
        int numMatches = 1

        // Iterate through remaining sub-iters in the heap
        foreach subIter (_heap) {
            // Skip all values greater than current value
            while ((subIter.getValue() > _currentValue) && 
                   subIter.next()) { }

            if (subIter.end()) {
                // No more intersections
                exhaustedIter = true
                break
            } else if (subIter.getValue() < _currentValue) {
                break
            } else if (subIter.getValue() == _currentValue) {
                ++numMatches
                if (!subIter.next()) {
                    exhaustedIter = true
                }
            }
        }

        if (numMatches == _numSubIters) {
            // Number matched in all sub iters
            if (!exhuastedIter) {
                remakeHeap(_heap)
            }
            return true
        } else if (exhaustedIter) {
            return false
        }

        // Didn't match, try again
        remakeHeap(_heap)
    }
}

My data sets could be highly variable but assume about 10 or less sub-iterators and long lists in each sub-iterator. The number of actual intersections could be very small in some cases, which makes me lean towards approach #2 to avoid constant pushing/popping on the heap... it does (however) use a build-heap operation once in awhile (less efficient than fix-heap) so I'm not sure which is going to be more efficient.

Any insight would be much appreciated :)

I should add that I can't pull the data into memory first. I must use the iterators because the datasets will potentially be too large to fit in memory.

Here are some data sets to consider, each with three sub-iterators. Remember that the heap is always built based on the top value in each sub-iterator (so in the first example the heap is initially based on 1000, 1000, and 10:

DATASET 1

1000     1000      10
999      999       9
.        .         .
.        .         .
.        .         .
0        0         0


DATASET 2

9        8         7
6        5         4
3        2         1


DATASET 3

3        3         3
2        2         1
1        1