I want to iterate over the intersection of n sub-iterators. Each sub-iterator is sorted from GREATEST TO LEAST, and the values must be accessed sequentially. Assume no duplicate values.
So far I have thought of two approaches using a heap/priority queue, but I'm struggling to figure out which is more efficient - it seems like it may be data dependent:
========== APPROACH 1
// Return false when there are no more intersecting values
// This method is on the top-level intersection iter, not the sub-iter
// _heap is sorted from **GREATEST TO LEAST**
bool next()
{
while (true) {
// _heap is sorted from greatest to least
_currentSubIter = _heap.pop()
_currentValue = _currentSubIter.getValue()
// Push the current iter back on the heap unless it is empty
if (_currentSubIter.next()) {
_heap.push(_currentSubIter)
}
int numMatches = 1
// Loop until next interseting match is found
while (!numMatches < _subIterCount &&
_currentValue == _heap.top().getValue()) {
// pop top sub-iter and increment
// push sub-iter back on the heap unless it is empty
++numMatches
}
if (_numMatches == _subIterCount) {
// Found the same value in all sub-iters
return true
} else if (_heap.size() < _numSubIters) {
// One or more sub-iters have been exhausted
// so there are no more possible intersections
return false
}
// Didn't match, try again
}
}
========== APPROACH 2
// Return false when there are no more intersecting values
// This method is on the top level intersection iter, not the sub-iter
// _heap is sorted from **LEAST TO GREATEST**
bool next()
{
// track whether we have exhausted a sub-iter
bool exhaustedIter = false
while (true) {
_currentSubIter = _heap.top()
_currentValue = _currentSubIter.getValue()
if (!_currentSubIter.next()) {
exhaustedIter = true
}
// count matching values
int numMatches = 1
// Iterate through remaining sub-iters in the heap
foreach subIter (_heap) {
// Skip all values greater than current value
while ((subIter.getValue() > _currentValue) &&
subIter.next()) { }
if (subIter.end()) {
// No more intersections
exhaustedIter = true
break
} else if (subIter.getValue() < _currentValue) {
break
} else if (subIter.getValue() == _currentValue) {
++numMatches
if (!subIter.next()) {
exhaustedIter = true
}
}
}
if (numMatches == _numSubIters) {
// Number matched in all sub iters
if (!exhuastedIter) {
remakeHeap(_heap)
}
return true
} else if (exhaustedIter) {
return false
}
// Didn't match, try again
remakeHeap(_heap)
}
}
My data sets could be highly variable but assume about 10 or less sub-iterators and long lists in each sub-iterator. The number of actual intersections could be very small in some cases, which makes me lean towards approach #2 to avoid constant pushing/popping on the heap... it does (however) use a build-heap operation once in awhile (less efficient than fix-heap) so I'm not sure which is going to be more efficient.
Any insight would be much appreciated :)
I should add that I can't pull the data into memory first. I must use the iterators because the datasets will potentially be too large to fit in memory.
Here are some data sets to consider, each with three sub-iterators. Remember that the heap is always built based on the top value in each sub-iterator (so in the first example the heap is initially based on 1000, 1000, and 10:
DATASET 1
1000 1000 10
999 999 9
. . .
. . .
. . .
0 0 0
DATASET 2
9 8 7
6 5 4
3 2 1
DATASET 3
3 3 3
2 2 1
1 1