tags:

views:

1026

answers:

2

How to do intersection and union for sets of the type tr1::unordered_set in c++? I can't find much reference about it.

Any reference and code will be highly appreciated. Thank you very much.

Update: I just guessed the tr1::unordered_set should provide the function for intersection, union, difference.. Since that's the basic operation of sets. Of course I can write a function by myself, but I just wonder if there are built in function from tr1. Thank you very much.

+3  A: 

There's nothing much to it - for intersect, just go through every element of one and ensure it's in the other. For union, add all items from both input sets.

For example:

void us_isect(std::tr1::unordered_set<int> &out,
     const std::tr1::unordered_set<int> &in1,
     const std::tr1::unordered_set<int> &in2)
{
    out.clear();
    if (in2.size() < in1.size()) {
     us_isect(out, in2, in1);
     return;
    }
    for (std::tr1::unordered_set<int>::const_iterator it = in1.begin(); it != in1.end(); it++)
    {
     if (in2.find(*it) != in2.end())
      out.insert(*it);
    }
}

void us_union(std::tr1::unordered_set<int> &out,
     const std::tr1::unordered_set<int> &in1,
     const std::tr1::unordered_set<int> &in2)
{
    out.clear();
    out.insert(in1.begin(), in1.end());
    out.insert(in2.begin(), in2.end());
}
bdonlan
You can speed up the case of intersecting a big set with a small one by iterating the small one and testing membership in the big one.
Dave
Indeed you can. Updated.
bdonlan
+3  A: 

I see that set_intersection() et al. from the algorithm header won't work as they explicitly require their inputs to be sorted -- guess you ruled them out already.

It occurs to me that the "naive" approach of iterating through hash A and looking up every element in hash B should actually give you near-optimal performance, since successive lookups in hash B will be going to the same hash bucket (assuming that both hashes are using the same hash function). That should give you decent memory locality, even though these buckets are almost certainly implemented as linked lists.

Here's some code for unordered_set_difference(), you can tweak it to make the versions for set union and set difference:

template <typename InIt1, typename InIt2, typename OutIt>
OutIt unordered_set_intersection(InIt1 b1, InIt1 e1, InIt2 b2, InIt2 e2, OutIt out) {
    while (!(b1 == e1)) {
        if (!(std::find(b2, e2, *b1) == e2)) {
            *out = *b1;
            ++out;
        }

        ++b1;
    }

    return out;
}

Assuming you have two unordered_sets, x and y, you can put their intersection in z using:

unordered_set_intersection(
    x.begin(), x.end(),
    y.begin(), y.end(),
    inserter(z, z.begin())
);

Unlike bdonlan's answer, this will actually work for any key types, and any combination of container types (although using set_intersection() will of course be faster if the source containers are sorted).

NOTE: If bucket occupancies are high, it's probably faster to copy each hash into a vector, sort them and set_intersection() them there, since searching within a bucket containing n elements is O(n).

j_random_hacker
+1 Excellent answer. It would be interesting to benchmark this code. It might actually be faster (if the sets are bigger but not too big) to copy them into a sorted set and run std::set_intersection().
ceretullis
Thanks ceretullis. Yes, I suspect that would be faster if the buckets have high occupancy, though in that case I suspect copying them to vectors and sorting those will be faster still, just because there is less memory overhead and no pointer chasing involved. (Sorting a vector and creating a sorted set are both O(nlog n).)
j_random_hacker