views:

123

answers:

5

Hello,

I have data that is a set of ordered ints

[0] = 12345 [1] = 12346 [2] = 12454 etc.

I need to check whether a value is in the collection in C++, what container will have the lowest complexity upon retrieval? In this case, the data does not grow after initiailization. In C# I would use a dictionary, in c++, I could either use a hash_map or set. If the data were unordered, I would use boost's unordered collections. However, do I have better options since the data is ordered? Thanks

EDIT: The size of the collection is a couple of hundred items

+3  A: 

Use a sorted std::vector, and use a std::binary_search to search it.

Your other options would be a hash_map (not in the C++ standard yet but there are other options, e.g. SGI's hash_map and boost::unordered_map), or an std::map.

If you're never adding to your collection, a sorted vector with binary_search will most likely have better performance than a map.

luke
I do wonder about the relative performance of the sorted vector against an unordered_set, both in terms of memory and raw speed. I don't think the answer is so clear-cut there.
Matthieu M.
@Matthieu, I'm not sure how `unordered_set` is implemented, but vectors and dequeues benefit from cache locality thanks to their contiguous memory.
luke
@luke: cache locality won't help with a binary search on a large vector, since you'll be doing single reads from many different locations.
Mike Seymour
+2  A: 

I'd suggest using a std::vector<int> to store them and a std::binary_search or std::lower_bound to retrieve them.

Both std::unordered_set and std::set add significant memory overhead - and even though the unordered_set provides O(1) lookup, the O(logn) binary search will probably outperform it given that the data is stored contiguously (no pointer following, less chance of a page fault etc.) and you don't need to calculate a hash function.

Joe Gauterin
Actually the hash function for a `int` is relatively trivial. As for the page fault / cache miss problem, I would not be so quick to hold them as such obstacles, especially since we don't know the magnitude of `n` there.
Matthieu M.
+4  A: 

If the data is in an ordered random-access container (e.g. std::vector, std::deque, or a plain array), then std::binary_search will find whether a value exists in logarithmic time. If you need to find where it is, use std::lower_bound (also logarithmic).

Mike Seymour
Binary search needs an ordered container. The `std::vector`, `std::deque` or *plain array* **must** be sorted in order for the `std::binary_search` and `std::lower_bound` to work correctly (produce correct results).
Thomas Matthews
+4  A: 

Just to detail a bit over what have already been said.

Sorted Containers

The immutability is extremely important here: std::map and std::set are usually implemented in terms of binary trees (red-black trees for my few versions of the STL) because of the requirements on insertion, retrieval and deletion operation (and notably because of the invalidation of iterators requirements).

However, because of immutability, as you suspected there are other candidates, not the least of them being array-like containers. They have here a few advantages:

  • minimal overhead (in term of memory)
  • contiguity of memory, and thus cache locality

Several "Random Access Containers" are available here:

  • Boost.Array
  • std::vector
  • std::deque

So the only thing you actually need to do can be broken done in 2 steps:

  • push all your values in the container of your choice, then (after all have been inserted) use std::sort on it.
  • search for the value using std::binary_search, which has O(log(n)) complexity

Because of cache locality, the search will in fact be faster even though the asymptotic behavior is similar.

If you don't want to reinvent the wheel, you can also check Alexandrescu's [AssocVector][1]. Alexandrescu basically ported the std::set and std::map interfaces over a std::vector:

  • because it's faster for small datasets
  • because it can be faster for frozen datasets

Unsorted Containers

Actually, if you really don't care about order and your collection is kind of big, then a unordered_set will be faster, especially because integers are so trivial to hash size_t hash_method(int i) { return i; }.

This could work very well... unless you're faced with a collection that somehow causes a lot of collisions, because then unsorted containers will search over the "collisions" list of a given hash in linear time.

Conclusion

Just try the sorted std::vector approach and the boost::unordered_set approach with a "real" dataset (and all optimizations on) and pick whichever gives you the best result.

Unfortunately we can't really help more there, because it heavily depends on the size of the dataset and the repartition of its elements

Matthieu M.
+1  A: 

If you already have an ordered array or std::vector<int> or similar container of the data, you can just use std::binary_search to probe each value. No setup time, but each probe will take O(log n) time, where n is the number of ordered ints you've got.

Alternately, you can use some sort of hash, such as boost::unordered_set<int>. This will require some time to set up, and probably more space, but each probe will take O(1) time on the average. (For small n, this O(1) could be more than the previous O(log n). Of course, for small n, the time is negligible anyway.)

There is no point in looking at anything like std::set or std::map, since those offer no advantage over binary search, given that the list of numbers to match will not change after being initialized.

So, the questions are the approximate value of n, and how many times you intend to probe the table. If you aren't going to check many values to see if they're in the ints provided, then setup time is very important, and std::binary_search on the sorted container is the way to go. If you're going to check a lot of values, it may be worth setting up a hash table. If n is large, the hash table will be faster for probing than binary search, and if there's a lot of probes this is the main cost.

So, if the number of ints to compare is reasonably small, or the number of probe values is small, go with the binary search. If the number of ints is large, and the number of probes is large, use the hash table.

David Thornley