views:

137

answers:

4

Dear reader,

I've got two vector<MyType*> objects called A and B. The MyType class has a field ID and I want to get the MyType* which are in A but not in B. I'm working on a image analysis application and I was hoping to find a fast/optimized solution.

Kind regards, Pollux

+2  A: 

Sort both vectors (std::sort) according to ID and then use std::set_difference. You will need to define a custom comparator to pass to both of these algorithms, for example

struct comp
{
    bool operator()(MyType * lhs, MyType * rhs) const
    {
        return lhs->id < rhs->id;
    }
};
avakar
Hi Avakar, thanks for your reply! Can you maybe show an example of how to use the set_difference in such a way that I get a new vector with the difference?
pollux
Could the OP just define `operator<` for `MyType`?
Bill
@Bill not if he stores MyType*.
@pollux `std::set_difference` takes an OutputIterator as the last argument. Use `std::back_inserter` to create one from a `vector<MyType*>`.
pmr
@sti: good point, nevermind!
Bill
+2  A: 

The unordered approach will typically have quadratic complexity unless the data is sorted beforehand (by your ID field), in which case it would be linear and would not require repeated searches through B.

struct CompareId
{
    bool operator()(const MyType* a, const MyType* b) const
    {
        return a>ID < b->ID;
    }
};
...
sort(A.begin(), A.end(), CompareId() );
sort(B.begin(), B.end(), CompareId() );

vector<MyType*> C;
set_difference(A.begin(), A.end(), B.begin(), B.end(), back_inserter(C) );

Another solution is to use an ordered container like std::set with CompareId used for the StrictWeakOrdering template argument. I think this would be better if you need to apply a lot of set operations. That has its own overhead (being a tree) but if you really find that to be an efficiency problem, you could implement a fast memory allocator to insert and remove elements super fast (note: only do this if you profile and determine this to be a bottleneck).

Warning: getting into somewhat complicated territory.

There is another solution you can consider which could be very fast if applicable and you never have to worry about sorting data. Basically, make any group of MyType objects which share the same ID store a shared counter (ex: pointer to unsigned int).

This will require creating a map of IDs to counters and require fetching the counter from the map each time a MyType object is created based on its ID. Since you have MyType objects with duplicate IDs, you shouldn't have to insert to the map as often as you create MyType objects (most can probably just fetch an existing counter).

In addition to this, have a global 'traversal' counter which gets incremented whenever it's fetched.

static unsigned int counter = 0;
unsigned int traversal_counter()
{
    // make this atomic for multithreaded applications and
    // needs to be modified to set all existing ID-associated
    // counters to 0 on overflow (see below)
    return ++counter;
}

Now let's go back to where you have A and B vectors storing MyType*. To fetch the elements in A that are not in B, we first call traversal_counter(). Assuming it's the first time we call it, that will give us a traversal value of 1.

Now iterate through every MyType* object in B and set the shared counter for each object from 0 to the traversal value, 1.

Now iterate through every MyType* object in A. The ones that have a counter value which doesn't match the current traversal value(1) are the elements in A that are not contained in B.

What happens when you overflow the traversal counter? In this case, we iterate through all the counters stored in the ID map and set them back to zero along with the traversal counter itself. This will only need to occur once in about 4 billion traversals if it's a 32-bit unsigned int.

This is about the fastest solution you can apply to your given problem. It can do any set operation in linear complexity on unsorted data (and always, not just in best-case scenarios like a hash table), but it does introduce some complexity so only consider it if you really need it.

A: 

First look at the problem. You want "everything in A not in B". That means you're going to have to visit "everything in A". You'll also have to visit everything in B to have knowledge of what is and is not in B. So that suggests there should be an O(n) + O(m) solution, or taking liberty to elide the difference between n and m, O(2n).

Let's consider the std::set_difference approach. Each sort is O(n log n), and set_difference is O(n). So the sort-sort-set_difference approach is O(n + 2n log n). Let's call that O(4n).

Another approach would be to first place the elements of B in a set (or map). Iteration across B to create the set is O(n) plus insertion O(log n) of each element, followed by iteration across A O(n), with a lookup for each element of A (log n), gives a total: O(2n log n). Let's call that O(3n), which is slightly better.

Finally, using an unordered_set (or unordered_map), and assuming we get average case of O(1) insertion and O(1) lookup, we have an approach that is O(2n). A-ha!

The real win here is that unordered_set (or map) is probably the most natural choice to represent your data in the first place, i.e., the proper design yields the optimized implementation. That doesn't always happen, but it's nice when it does!

John
@John, "[...] approach is O(n + 2n log n). Let's call that O(4n)." You need to read this: http://en.wikipedia.org/wiki/Big_O_notation
avakar
Nice analysis of the algorithmic complexities and very nice that you pointed out unordered_set, but the unordered vector case would be O(N) * O(M), not O(N) + O(M), or O(N^2). We have to repeatedly search B for each element in A. Also O(N*logN) is considerably different than O(2N). If N is 1000, 2N would be 2000 while N*logN would be ~10,000. I don't think we can simplify it that much.
A: 

If B preexists to A, then while populating A, you can bookkeep in a C vector.

tojas