views:

56

answers:

3

I spent a considerable amount of time coding in Baeza-Yates' fast set intersection algorithm for one of my apps. While I did marginally out-do the STL set_intersect, the fact that I required the resultant set to be sorted removed any time I had gained from implementing my own algorithm after I sorted the output. Given that the STL set_intersect performs this well, can anyone point me to the algorithm that it actually implements? Or does it implement the same Baeza-Yates algorithm but only in a much more efficient manner?

Baeza-Yates: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.7899&rep=rep1&type=pdf

+3  A: 

STL doesn't require any particular algorithm, it just sets constraints on the algorithmic complexity of certain operations. Since it's all template based, you can easily view the source to your particular implementation to see how it works.

Mark Ransom
+1. I was starting to write an answer, and copied this, but your answer looks quite good, so I'll paste it here instead. The specific requirement is: "Complexity: At most 2 * ((last1 - first1) + (last2 - first2)) - 1 comparisons."
Jerry Coffin
Easily view it, sure. Understanding how it works before your head explodes? Now that's a different story.
PigBen
@PigBen, if you're curious about these kinds of things I assume your head explosion threshold is a bit higher than most. My own threshold is a lot lower than I'd like.
Mark Ransom
+1  A: 

At least in the implementations I've looked at, the implementation is fairly simplistic -- something on this general order:

template <class inIt, class outIt>
outIt set_intersection(inIt start1, inIt end1, inIt start2, inIt end2, outIt out) {
    while (start1 != end1 && start2 != end2) {
       if (*start1 < *start2)
           ++start1;
       else if (*start2 < *start1)
           ++start2;
       else {                 // equal elements.
           *out++ = *start1;
           ++start1;
           ++start2;
       }
    }
    return out;
}

Of course, I'm just typing this off the top of my head -- it probably won't even compile, and certainly isn't pedantically correct (e.g., should probably use a comparator function instead of using operator< directly, and should have another template parameter to allow start1/end1 to be a different type from start2/end2).

From an algorithmic viewpoint, however, I'd guess most real implementations are pretty much as above.

Jerry Coffin
A: 

Interesting. So, the number of comparisons in your algorithm linearly scales with the number of elements in both sets. The Baeza-Yates algorithm goes something like this (note that it assumes both input sets are sorted) :

1) Find the median of set A (A is the smaller set here) 2) Search for the median of A in B. If found, add to the result else, the insertion rank of the median in B is known. 3) Split set A about its median into two parts, and set B about its insertion rank into two parts, and repeat the procedure recursively on both parts. This step works because all elements less than the median in A would intersect only with those elements before the insertion rank of A's median in B.

Since you can use a binary search to locate A's median in B, clearly, the number of comparisons in the this algorithm is lower than the one you mentioned. In fact, in the "best" case, the number of comparisons is O(log(m) * log(n)), where m and n are the sizes of the sets, and in the worst case, the number of comparisons is O(m + n). How on earth did I mess up the implementation this bad? :(

darthshak