views:

149

answers:

7

I need to find a number of objects from a large container.

The only way I can think of to do that seems to be to just search the container for one item at a time in a loop, however, even which an efficient search with an average case of say "log n" (where n is the size of the container), this gives me "m log n" (where m is the number of items I'm looking for) for the entire operation.

That seems highly suboptimal to me, and as its something that I am likely to need to do on a frequent bases, something I'd definitely like to improve if possible.

Neither part has been implemented yet, so I'm open for suggestions on the format of the main container, the "list" of items I'm looking for, etc, as well as the actual search algorithm.

The items are complex objects, however the search key is just a simple integer.

+5  A: 

If you're purely doing look-up (you don't require ordered elements) and can give up some memory, try unordered_map (it's TR1, also implemented in Boost), which has constant-time amortized look-up.

In a game engine, we tested std::map and unordered_map, and while map was faster for insertions (if I recall), unordered_map blew it out of the water for retrieval. We had greater than 1000 elements in the map, for scale, which is fairly low compared to some other tasks you may be doing.

If you require elements to be ordered, your next bet is std::map, which has the look-up times you've posted, and keeps the elements ordered. In general, it also uses less memory than an unordered_map.

GMan
A: 

boost/tr1 unordered_map and unordered_set are containers backed by a hash table which gives you search in amortized contant time [ O(1) ]

Boost Unordered documentation.

drspod
A: 

I suppose if you have a sorted container and a uniform distribution of items then the most efficient type of method would be a recursive bisection search with an execution path somewhat like a tree - calling itself twice whenever all the objects being searched for are in both halves of the bisection.

However, if you choose a container based on a hash-table (boost unordered set, I think?), or something similar, then lookup can be O(1), so searching in a loop really doesn't matter.

EDIT: note that std::map and std::set are normally (always?) implemented using rb-trees, so are only log(n) for lookup.

Autopulated
The way `map` and co. are implemented is an implementation detail. The standard only says how the structure should perform asymptotically. Commonly, yes, it's a red-black tree, but implementors are free to use whatever they see fit.
GMan
I thought as much, but I've never seen one that wasn't a red-black tree :)
Autopulated
That's because red-black tree's are sexy. :P
GMan
+4  A: 

Hash tables have basically O(1) lookup. This gives you O(m) to lookup m items; obviously you can't lookup m items faster than O(m) because you need to get the result out.

antti.huima
+3  A: 

If your container is a vector and the elements are sorted, you can use std::lower_bound to search in O(log n) time. If your search items are also sorted, you can do a small optimization by always using the last found iterator as the start of the search for the next one, e.g.

vector<stuff> container;
vector<stuff>::iterator it = container.begin();
for (int i = 0;  i < search_items.size() && it != container.end();  ++i)
{
    it = std::lower_bound(it, container.end(), search_items[i]);
    // make sure the found item is a match
    if (it != container.end() && search_items[i] < *it)
        it = container.end(); // break out early
}
if (it != container.end())  // found it!
Mark Ransom
I completely forgot about this optimization. It is really good if you pop your target keys into a set and iterate over them and the map at the same time.
D.Shawley
+1  A: 

Are you sure that m log2(n) is actually going to be a problem? If you are using a std::map that is even relatively large, the number of actually comparisons is still pretty small - if you are looking up 10,000 elements in a map of 1,000,000, the number of comparisons should be about 200,000 or about 20 comparisons per target element. This really isn't bad if your key is just a simple integer.

If you were hashing something that didn't already have a nice key, then I would say go with boost::unordered_map. I would implement it with std::map first, profile it, and then decide if you want to make the next jump to Boost.

D.Shawley
Make sure to look at Mark's answer about using your last position as a start for searching, this is a really nice optimization.
D.Shawley
A: 

If you're frequently performing the same projections on your collection, such as extracting elements with a key of "42", you could consider maintaining these subsets in buckets. You'd internally maintain a hashmap from keys to vectors of elements with that key, and add elements to the appropriate bucket as well as your primary collection representing "everything". Extracting a subgroup of elements is constant time (because the respective collections have already been built), and the memory overhead of maintaining the buckets scales primarily with the number of unique keys in your dataset.

This technique is decidedly less effective if you have a large number of unique key values, and makes insertions and removals more expensive, but it's good for some situations- I thought it was at least worth mentioning.

JohnE