views:

408

answers:

6

I need some kind of priority queue to store pairs <key, value>. Values are unique, but keys aren't. I will be performing the following operations (most common first):

  1. random insertion;
  2. retrieving (and removing) all elements with the least key.
  3. random removal (by value);

I can't use std::priority_queue because it only supports removing the head.

For now, I'm using an unsorted std::list. Insertion is performed by just pushing new elements to the back (O(1)). Operation 2 sorts the list with list::sort (O(N*logN)), before performing the actual retrieval. Removal, however, is O(n), which is a bit expensive.

Any idea of a better data structure?

+1  A: 

If you're using Visual Studio they have hash_multimap. I should also add that Boost has an unordered multimap, here. If you need an ordered multimap, STL multimap or ordered multiset STL multiset

wheaties
I think the standard container should be the default option in most situations if nothing else because it is the standard container.
Billy ONeal
pfft, standard containers. If people used those programs would crash less. Then people like me wouldn't have to explain to a boss why an unrelated section of code is impinging production of new functionality and we just got "lucky" it never happened before. Then we'd never have to hire high priced consultants to write "better" containers that didn't crash until they did. You see the cycle of life you're attempting to ruin here!?
wheaties
See my solution.
Helltone
@Helltone multimap holds things in order but can take multiple values per key. Inserts are O(log) while retrievals can be up to O(N) in time. I think you should go with multimap.
wheaties
+4  A: 

Can you reverse the order of the collection, i.e. store them in <value, key> order?

Then you could just use std::map having O(logn) time for insertion O(n) for removal (traversing whole collection) and O(logn) for random removal of value (which would be the key of said map).

If you could find a map implementation based on hashes instead of trees (like std::map) the times would be even better: O(1), O(n), O(1).

pajton
+1. Or the nonstandard container `hash_map`/soon to be standard `std::unordered_map`.
Billy ONeal
It needs to order and retrieve by key, so I'm not sure this would work.
Mark B
@BillyONeal yes, `hashmap` would yield even better times `O(1)`, `O(n)`, `O(1)`.@MarkB he didn't state it anywhere, so why do you think so?
pajton
@pajton: Actually, I like your answer better -- the standard containers should be used over nonstandard containers unless you have *very* good reason to do otherwise because the standard containers are just that -- standard. And the hashed containers do not offer O(1) worst case performance -- but they do on average. For some applications amortized analysis is not allowable practice.
Billy ONeal
A: 

std::multimap seem to be what you are searching for.

It will store your objects ordered by key, allow you to retrieve the lowest/highest key value (begin(), rbegin()) and all the object with a given key (equal_range, lower_bound, upper_bound).

(EDIT: if you have just a few items, say less than 30, you should also test the performance of just using a deque or a vector)

baol
A: 

If I understood well, you performance target is to have fast (1) and (3), and (2) is not that important. In this case, and given that values are unique, why not just have a std::set<value>, and do a sequential search for (2)? You'd have O(log n) for (1) and (3), and O(n) for (2). Better yet, if your STL has std::hash_set, you'd have close to O(1) for (1) and (3).

If you need something better than O(n) for (2), one alternative would be to have a set of priority queues.

Fabio Ceconello
I said that the operations were **most common first**. My initial solution is better than yours because its O(1) for insertion.
Helltone
I don't think you'll be able to speed up the other operations without giving up a bit in (1), there's no magic here. Anyway, if you have a std::hash_set implementation the insertion is quite close to O(1)
Fabio Ceconello
Your solution `Helltone` has not be timed, so you don't know if it's better. Notably you forgot to consider the memory aspects of speed, like caching and heap usage. Unless you time it, you can't say for sure if it's faster. A big `O(1)` is worse than `O(log(n))` for small values of `n`.
Matthieu M.
+3  A: 

When you need order, use an ordered container. There is no point in paying the cost of sorting later on.

Your current solution is:

  • Insertion O(1)
  • Retrieval O(N log N)
  • Removal O(N) (which is as good as you can get without keeping another index there)

Simply using a std::multi_map you could have:

  • Insertion O(log N)
  • Retrieval O(log N) <-- much better isn't it ? We need to find the end of the range
  • Removal O(N)

Now, you could do slightly better with a std::map< key, std::vector<value> >:

  • Insertion O(log M) where M is the number of distinct keys
  • Retrieval O(1) (begin is guaranteed to be amortized constant time)
  • Removal O(N)

You can't really push the random removal... unless you're willing to keep another index there. For example:

typedef std::vector<value_type> data_value_t;
typedef std::map<key_type, data_value_t> data_t;

typedef std::pair<data_t::iterator,size_t> index_value_t;
  // where iterator gives you the right vector and size_t is an index in it

typedef std::unordered_map<value_type, index_value_t> index_t;

But keeping this second index up to date is error prone... and will be done at the expense of the other operations! For example, with this structure, you would have:

  • Insertion O(log M) --> complexity of insertion in hash map is O(1)
  • Retrieval O(N/M) --> need to de index all the values in the vector, there are N/M in average
  • Removal O(N/M) --> finding in hash map O(1), dereferencing O(1), removing from the vector O(N/M) because we need to shift approximately half the content of the vector. Using a list would yield O(1)... but might not be faster (depends on the number of elements because of the memory tradeoff).

Also bear in mind that hash map complexity are amortized ones. Trigger a reallocation because you overgrew the load factor, and this particular insertion will take a very long time.

I'd really go with the std::map<key_type, std::vector<value_type> > in your stead. That's the best bang for the buck.

Matthieu M.
boost has multi-index container. So if you're going to go that route, use someone else's already debugged code.
Eric H.
Unfortunately I don't know of a way to obtain the final structure with a `MultiIndex` container. I mean, you could ask for `multiset` on key and `hash_map` on values with unique constraint, but that's not the final design I devised. Probably good enough though.
Matthieu M.
A: 

Ok, so I've tested many options and ended up with something based on the idea of Matthieu M.. I'm currently using a std::map<key_type, std::list<value_type> >, where the value_type contains a std::list<value_type>::iterator to itself, which is useful for removal.

Removal must check if the vector is empty, which implies a map query and possibly a call to erase. Worst-case complexity is when keys are distinct, O(logN) for insertion, O(1) for retrieval and O(logN) for removal. I've got very good experimental results comparing to other alternatives on my test machine.

Using a std::vector is less efficient both in terms of theoretical complexity (O(N) worst-case for removal when keys are identical) and experimentation I've been doing.

Helltone