views:

1449

answers:

10

I'm intersecting some sets of numbers, and doing this by storing a count of each time I see a number in a map.

I'm finding the performance be very slow.

Details: - One of the sets has 150,000 numbers in it - The intersection of that set and another set takes about 300ms the first time, and about 5000ms the second time - I haven't done any profiling yet, but every time I break the debugger while doing the intersection its in malloc.c!

So, how can I improve this performance? Switch to a different data structure? Some how improve the memory allocation performance of map?

Update:

  1. Is there any way to ask std::map or boost::unordered_map to pre-allocate some space?
  2. Or, are there any tips for using these efficiently?

Update2:

See http://stackoverflow.com/questions/1056366/fast-c-container-like-the-c-hashsett-and-dictionaryk-v

Update3:

I benchmarked set_intersection and got horrible results:

(set_intersection) Found 313 values in the intersection, in 11345ms
(set_intersection) Found 309 values in the intersection, in 12332ms

Code:

int runIntersectionTestAlgo()
{   

    set<int> set1;
    set<int> set2;
    set<int> intersection;


    // Create 100,000 values for set1
    for ( int i = 0; i < 100000; i++ )
    {
     int value = 1000000000 + i;
     set1.insert(value);
    }

    // Create 1,000 values for set2
    for ( int i = 0; i < 1000; i++ )
    {
     int random = rand() % 200000 + 1;
     random *= 10;

     int value = 1000000000 + random;
     set2.insert(value);
    }

    set_intersection(set1.begin(),set1.end(), set2.begin(), set2.end(), inserter(intersection, intersection.end()));

    return intersection.size(); 
}
A: 

What is your intersection algorithm? Maybe there are some improvements to be made?

Here is an alternate method

I do not know it to be faster or slower, but it could be something to try. Before doing so, I also recommend using a profiler to ensure you really are working on the hotspot. Change the sets of numbers you are intersecting to use std::set<int> instead. Then iterate through the smallest one looking at each value you find. For each value in the smallest set, use the find method to see if the number is present in each of the other sets (for performance, search from smallest to largest).

This is optimised in the case that the number is not found in all of the sets, so if the intersection is relatively small, it may be fast.

Then, store the intersection in std::vector<int> instead - insertion using push_back is also very fast.

Here is another alternate method

Change the sets of numbers to std::vector<int> and use std::sort to sort from smallest to largest. Then use std::binary_search to find the values, using roughly the same method as above. This may be faster than searching a std::set since the array is more tightly packed in memory. Actually, never mind that, you can then just iterate through the values in lock-step, looking at the ones with the same value. Increment only the iterators which are less than the minimum value you saw at the previous step (if the values were different).

1800 INFORMATION
I have N iterators, one for each set. Each time through a loop, I advance each iterator, increment the count of times I've seen each value, once a value's count reaches N then I've found an value that exists in all sets.I'm using a std:map<int,int> to track counts of each value.
Alex Black
For your second thing, that is exactly what std::set_intersection already does. And for your first thing, it takes like O(m*log(n)) which is inefficient if you already have std::sets (sorted containers), so you could just use std::set_intersection on them, which O(m+n).
newacct
set_intersection performed very slowly in my benchmark.
Alex Black
using set_intersection with actual std::set's is likely to be bad. I would try using sorted vectors instead
1800 INFORMATION
The latest code uses sorted vectors.
Alex Black
Are you still inserting into a set? Or a vector? I think you want to be inserting into a vector - you can set the size of the vector before you run, so that the memory for the vector does not have to be reallocated. Are you still spending most of your time in malloc? If so, then preallocating the vector you are inserting into will be faster
1800 INFORMATION
+1  A: 

Without knowing any more about your problem, "check with a good profiler" is the best general advise I can give. Beyond that...

If memory allocation is your problem, switch to some sort of pooled allocator that reduces calls to malloc. Boost has a number of custom allocators that should be compatible with std::allocator<T>. In fact, you may even try this before profiling, if you've already noticed debug-break samples always ending up in malloc.

If your number-space is known to be dense, you can switch to using a vector- or bitset-based implementation, using your numbers as indexes in the vector.

If your number-space is mostly sparse but has some natural clustering (this is a big if), you may switch to a map-of-vectors. Use higher-order bits for map indexing, and lower-order bits for vector indexing. This is functionally very similar to simply using a pooled allocator, but it is likely to give you better caching behavior. This makes sense, since you are providing more information to the machine (clustering is explicit and cache-friendly, rather than a random distribution you'd expect from pool allocation).

Tom
A: 

Might be your algorithm. As I understand it, you are spinning over each set (which I'm hoping is a standard set), and throwing them into yet another map. This is doing a lot of work you don't need to do, since the keys of a standard set are in sorted order already. Instead, take a "merge-sort" like approach. Spin over each iter, dereferencing to find the min. Count the number that have that min, and increment those. If the count was N, add it to the intersection. Repeat until the first map hits it's end (If you compare the sizes before starting, you won't have to check every map's end each time).

Responding to update: There do exist faculties to speed up memory allocation by pre-reserving space, like boost::pool_alloc. Something like:

std::map<int, int, std::less<int>, boost::pool_allocator< std::pair<int const, int> > > m;

But honestly, malloc is pretty good at what it does; I'd profile before doing anything too extreme.

Todd Gardner
Thats a good idea... but right now my sets aren't sorted. I might try sorting them first.
Alex Black
Some more information on what these "sets" you are intersectioning might be helpful - are they like vectors?
Todd Gardner
They're stored in an in-memory database, they're lists of numbers, I was hoping not to have to get the entire set in one data structure and sort it before starting to intersect them.
Alex Black
I'll try out boost::pool_alloc just to see if it makes a difference.
Alex Black
I tried out boost::pool_alloc and boost::fast_pool_alloc on my boost:unordered_map and performance got MUCH slower.
Alex Black
A: 

Intersection with maps are slow, try a hash_map. (however, this is not provided in all STL implementation.

Alternatively, sort both map and do it in a merge-sort-like way.

J-16 SDiZ
I tried hash_map and boost::unordered_map, similar poor performance.
Alex Black
+1  A: 

I would second the suggestion to sort them. There are already STL set algorithms that operate on sorted ranges (like set_intersection, set_union, etc):

set_intersection

M S
I might try that, but that requires I obtain each set in full first, and sort them.I have implemented this same algorithm in C# using HashSet<int> and it is blazingly fast.
Alex Black
The C# algorithm runs against the same data too. Is C# memory allocation faster? Or is HashSet faster than map/hash_map/unordered_map?
Alex Black
+1  A: 

I don't understand why you have to use a map to do intersection. Like people have said, you could put the sets in std::set's, and then use std::set_intersection().

Or you can put them into hash_set's. But then you would have to implement intersection manually: technically you only need to put one of the sets into a hash_set, and then loop through the other one, and test if each element is contained in the hash_set.

newacct
I tried set_intersection, got horrible results.
Alex Black
A: 

Look at your algorithms, then choose the proper data type. If you're going to have set-like behaviour, and want to do intersections and the like, std::set is the container to use.

Since it's elements are stored in a sorted way, insertion may cost you O(log N), but intersection with another (sorted!) std::set can be done in linear time.

xtofl
A: 

I figured something out: if I attach the debugger to either RELEASE or DEBUG builds (e.g. hit F5 in the IDE), then I get horrible times.

Alex Black
+1  A: 

You should definitely be using preallocated vectors which are way faster. The problem with doing set intersection with stl sets is that each time you move to the next element you're chasing a dynamically allocated pointer, which could easily not be in your CPU caches. With a vector the next element will often be in your cache because it's physically close to the previous element.

The trick with vectors, is that if you don't preallocate the memory for a task like this, it'll perform EVEN WORSE because it'll go on reallocating memory as it resizes itself during your initialization step.

Try something like this instaed - it'll be WAY faster.

int runIntersectionTestAlgo() { 

vector<char> vector1; vector1.reserve(100000);
vector<char> vector2; vector2.reserve(1000);

// Create 100,000 values for set1
for ( int i = 0; i < 100000; i++ )    {
    int value = 1000000000 + i;
    set1.push_back(value);
}

sort(vector1.begin(), vector1.end());

// Create 1,000 values for set2
for ( int i = 0; i < 1000; i++ )    {
    int random = rand() % 200000 + 1;
    random *= 10;
    int value = 1000000000 + random;
    set2.push_back(value);
}

sort(vector2.begin(), vector2.end());

// Reserve at most 1,000 spots for the intersection
vector<char> intersection; intersection.reserve(min(vector1.size(),vector2.size()));
set_intersection(vector1.begin(), vector1.end(),vector2.begin(), vector2.end(),back_inserter(intersection));

return intersection.size(); 
}
Chris Harris
Hi Chris, how is that different than my latest code? The latest code does reserve space in the vectors, does pass vectors to set_intersection etc.
Alex Black
@Chris: sorry, see this thread: http://stackoverflow.com/questions/1060648/fast-intersection-of-sets-c-vs-c it uses the approach you described and is definitely faster.
Alex Black
A: 

HI Alex, so what was your final version ? Still using std::set ? Or a sorted vector ? [Out of interest] Thank you.

Jonathan