views:

570

answers:

5

I need to map primitive keys (int, maybe long) to struct values in a high-performance hash map data structure.

My program will have a few hundred of these maps, and each map will generally have at most a few thousand entries. However, the maps will be "refreshing" or "churning" constantly; imagine processing millions of add and delete messages a second.

What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!

+6  A: 

I would recommend you to try Google SparseHash and see if it suits your needs. They have a memory efficient implementation as well as one optimized for speed. I did a benchmark a long time ago, it was the best hashtable implementation available in terms of speed (however with drawbacks).

Scharron
Can you elaborate on what the drawbacks were?
Haywood Jablomey
IIRC, it was a memory problem, when removing an element, the element was destructed but its memory was still alive (used as a cache I guess).
Scharron
@Haywood Jablomey: The main drawback is that it requires you to split off one or two (if you ever erase elements) values and never use those. In some cases this is easy to do, e.g. negative ints or like that, but in other cases not quite so.
doublep
+6  A: 

What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!

Check out the LGPL'd Judy arrays. Never used myself, but was advertised to me on few occasions.

You can also try to benchmark STL containers (std::hash_map, etc). Depending on platform/implementation and source code tuning (preallocate as much as you can dynamic memory management is expensive) they could be performant enough.

Also, if performance of the final solution trumps the cost of the solution, you can try to order the system with sufficient RAM to put everything into plain arrays. Performance of access by index is unbeatable.

The add/delete operations are much (100x) more frequent than the get operation.

That hints that you might want to concentrate on improving algorithms first. If data are only written, not read, then why write them at all?

Dummy00001
+1 I would've answered Judy if you weren't first. :-)
Peter G.
+3  A: 

Just use boost::unordered_map (or tr1 etc) by default. Then profile your code and see if that code is the bottleneck. Only then would I suggest to precisely analyze your requirements to find a faster substitute.

Mark B
A: 

First check if existing solutions like libmemcache fits your need.

If not ...

Hash maps seems to be the definite answer to your requirement. It provides o(1) lookup based on the keys. Most STL libraries provide some sort of hash these days. So use the one provided by your platform.

Once that part is done, you have to test the solution to see if the default hashing algorithm is good enough performance wise for your needs.

If it is not, you should explore some good fast hashing algorithms found on the net

  1. good old prime number multiply algo
  2. http://www.azillionmonkeys.com/qed/hash.html
  3. http://burtleburtle.net/bob/
  4. http://code.google.com/p/google-sparsehash/

If this is not good enough, you could roll a hashing module by yourself, that fixes the problem that you saw with the STL containers you have tested, and one of the hashing algorithms above. Be sure to post the results somewhere.

Oh and its interesting that you have multiple maps ... perhaps you can simplify by having your key as a 64 bit num with the high bits used to distinguish which map it belongs to and add all key value pairs to one giant hash. I have seen hashes that have hundred thousand or so symbols working perfectly well on the basic prime number hashing algorithm quite well.

You can check how that solution performs compared to hundreds of maps .. i think that could be better from a memory profiling point of view ... please do post the results somewhere if you do get to do this exercise

I believe that more than the hashing algorithm it could be the constant add/delete of memory (can it be avoided?) and the cpu cache usage profile that might be more crucial for the performance of your application

good luck

A: 

Try hash tables from Miscellaneous Container Templates. Its closed_hash_map is about the same speed as Google's dense_hash_map, but is easier to use (no restriction on contained values) and has some other perks as well.

doublep