views:

1137

answers:

8

I would like to store a mapping of an integer key to a float value in-memory.

I have roughly 130 million keys (and, accordingly, 130 million values).

My focus is on lookup performance -- I have to do many, many millions of lookups.

The C++ STL library has a map class for associative arrays of this sort. I have several questions about map.

What is the storage overhead of map for a dataset of the size mentioned above? How does storage overhead scale, in general, with map?

It looks like the underlying data structure for map is a red-black, balanced binary tree. It sounds like the real-world performance for this is O(log n) for insertion and retrieval.

It mentions O(1) for a hinted insertion. My input is pre-sorted, so I believe I should be able to provide a hint for insertion events. How would I provide this hint, using the methods listed here?

Is there an STL container that provides better lookup performance?

Are there other publicly-available, open-source frameworks with an associate array class that uses an underlying data structure that would perform better than STL map?

If writing my own container class would provide better lookup performance, what data structures might I research?

I am using GCC 4 for this task, running under either Linux or Mac OS X.

I apologize in advance if these are dumb questions. Thank you for your advice.

+2  A: 

Most compilers ship with a non-standard (but working) hash_map (or unordered_map) that might be faster for you. It's coming in C++0x (is in tr1) and it is also (as always) in boost already.

GCC did too, but I haven't done C++ on that for .. 12 years .., but it should still be in there somewhere.

Marcus Lindblom
`unordered_map` is the name that hash maps got in the standard technical review 1 (and were further pushed into the C++ draft), so they are 'quite' standard at the very least.
David Rodríguez - dribeas
There is something in boost: http://www.boost.org/doc/libs/1_40_0/doc/html/boost/unordered_map.html
Eld
Thanks for the input, guys! I've edited my answer to include that.
Marcus Lindblom
+3  A: 

If your input is sorted, you should try just a vector and a binary search (i.e., lower_bound()). This might prove adaquate (it is also O(log n)). Depending on the distribution of your keys and the hash function used, a hash_map might work as well. I think this is tr1::unordered_map in gcc.

KeithB
+14  A: 

Given what you've said, I'd think very hard about using an std::vector<pair<int, float> >, and using std::lower_bound, std::upper_bound, and/or std::equal_range to look up values.

While the exact overhead of std::map can (and does) vary, there's little or no room for question that it will normally consume extra memory and look up values more slowly than a binary search in a vector. As you've noted, it's normally (and almost unavoidably) implemented as some sort of balanced tree, which imposes overhead for the pointers and the balancing information, and typically means each node is allocated separately as well. Since your nodes are pretty small (typically 8 bytes) that extra data is likely to be at least as much as what you're actually storing (i.e. at least 100% overhead). Separate allocations often mean poor locality of reference, which leads to poor cache usage.

Edit: Looking just at implementations of std::map, it's probably worth noting that most use a red-black tree. If you were going to use an std::map, an implementation that uses an AVL tree would probably suit your purposes better -- an AVL tree has slightly tighter constraints on balancing. This gives slightly faster lookup at the expense of slightly slower insertion and deletion (since it has to re-balance more often to maintain its stricter interpretation of "balanced"). As long as your data remains constant during use, however, an std::vector is still almost certainly better.

One other possibility worth noting: if your keys are at least fairly even distributed, you might want to try looking up using interpolation instead of bisection. i.e. instead of always starting at the middle of the vector, you do a linear interpolation to guess at the most likely starting point for the lookup. Of course, if your keys follow some known non-linear distribution, you can use a matching interpolation instead.

Edit 2: Assuming the keys are reasonably even distributed, the interpolation search has a complexity of O(log log N). For 130 million keys, that works out to around 4 probes to find an item. To do significantly better than that with (normal/non-perfect) hashing, you need a good algorithm, and you need to keep the load factor in the table around 75% or so -- i.e. you need to allow for something like 32 million extra (empty) spots in your table to improve the expected complexity from four probes to three. I may just be old fashioned, but that strikes me as a lot of extra storage to use for such a small speed improvement.

OTOH, it's true that this is nearly the ideal situation for perfect hashing -- the set is known ahead of time, and the key is quite small (important, since hashing is normally linear on the key size). Even so, unless the keys are distributed pretty unevenly, I wouldn't expect any huge improvement -- a perfect hash function is often (usually?) fairly complex.

Jerry Coffin
definitely just use a binary search in the vector. Least memory, fastest too.
Will
What about insertions ? To use binary search, you'll have to keep the array sorted. Random insertions in a vector are not particularly efficient.
RaphaelSP
@RaphealSP: yes, if the data were dynamic (i.e. you need to support insertions/deletions during use), a sorted vector isn't a good choice. He notes, however, that the data starts out sorted, which I took as indicating that he's just reading in data, but not modifying it afterwards.
Jerry Coffin
I am only inserting once. I do not need to modify my input set afterwards.
Alex Reynolds
Ooops, it seems I've read the question too fast... I'll leave my comment in case someone does the same.
RaphaelSP
If you deal with very large data, you could consider using a std::deque instead of a vector. The vector consumes one large continuous chunk of memory. The deque uses lots of smaller chunks of the same size. This could be friendlier for the memory management of the operating system.If is easy to try this out, because the interface of a std::deque is the same for that purpose.
SebastianK
This solution is O(log n), and if that's good enough then this is the simplest solution. Otherwise consider a hash, it approaches O(1) (especially the perfect hash).
Mark Ransom
+1  A: 

As a partial answer to your question concerning lookup performance, you have to consider your insertion pattern. You noted that std::map uses a Red-Black Tree as a hedge against linearizing a carefully sorted insertion into a linked list. Hence, such a tree provides O(log n) lookup time despite an aberrant insertion order. You pay a cost for this, however, in insertion, removal, and traversal performance, as well as losing locality of reference for repeated reading of "nearby" data.

A hash table may offer faster lookup if you can tune a hash function for your key type (an integer, you say) that will preclude collisions. If your data set was fixed, such that you could load it once and only read it afterward, you could use parallel arrays of integers and floats, and use std::lower_bound to find your match via binary search. Sorting the parallel arrays properly would be a chore, if your keys are divorced from their corresponding values, but you'd enjoy tighter storage and locality of reference than storing an array of, say, std::pair.

seh
A: 

You may have a look at std::tr1::unorderd_map.

But if you have 32 bit unsigned integer keys (4294967296 possible values) and 130 million different keys, then you should write your own container optimized for this task. Especially if the 130 million key case is the usual case (and not only a rare maximum).

4294967296 / 130000000 = 33, so about each 33rd number in the whole space is used in your data.

You could for example divide your key range into fixed size partitions. If the keys are rather evenly distributed, you should partition the key space into e.g. 256-sized buckets, or even 32-sized buckets, depends on how much storage you want to waste when only a few values are stored.

Example, to give you an idea:

#define BUCKET_SIZE  256
#define BUCKET_SIZE_SHIFT  8
struct Bucket {
  uint32_t key;
  float value;
  Bucket* pNext;
};

Bucket data[ 4294967296 / BUCKET_SIZE ];

Bucket* find( uint32_t key ) {
  uint32_t bucket_index = key / BUCKET_SIZE;
  // or faster: uint32_t bucket_index = key >> BUCKET_SIZE_SHIFT;
  Bucket* pBucket = &data[ bucket_index ];
  while( pBucket ) {
    if( pBucket->key == key ) return pBucket;
    pBucket = pBucket->pNext;
  }
  return NULL;
}
frunsi
That looks a lot like C and not C++, but still a good idea.
tstenner
Well yes, it looks like C. Of course one will wrap this into a class. Or even into a template with value_type and bucket_size as template parameters and so on ;-)
frunsi
What's the big point in reinventing the wheel? Surely `unordered_map` exists for a reason. I'm not usually one to shout about premature optimization, but writing your own custom hash map with a custom, non-STL-compatible interface, and its own set of a few dozen bugs just because you *think* you can write faster code than the TR1 implementation? Even though 1) that might not be the case, and 2) it might not matter if this isn't a performance bottleneck?
jalf
+1  A: 

If your keys are unchanging, you might consider a perfect hash function as an alternative to a standard container.

I don't know what obstacles you'll run into with a dataset of that size, but it might be worth spending a few minutes experimenting.

Mark Ransom
+1  A: 

A vector is absolutely going to kill a map here assuming you don't need to do insertions in the middle of the vector. I wrote a custom allocator to track memory usage, and here are the results in Visual Studio 2005:

std::map<int, float>:

1.3 million insertions
Total memory allocated: 29,859 KB
Total blocks allocated: 1,274,001
Total time: 17.5 seconds

std::vector<std::pair<int, float> >:

1.3 million insertions
Total memory allocated: 12,303 KB
Total blocks allocated: 1
Total time: 0.88 seconds

std::map is using over twice the storage and taking 20 times longer to insert all the items.

cwick
And if `vector` is tenable, assuming the data will only be loaded once, the insertion time isn't too important. I expect that lookups into the `vector` via `lower_bound` will beat the `map` too, though the indirection through the `pair` selection key extraction functor might hurt.
seh
A: 

Considering the vast amount of memory used, you must also consider that any memory access in searching will result in a memory cache fault.

In this regard a mixed solution of a small hashmap as first layer and sorted vectors for the buckets is probably the best.

The idea is to keep the hashtable index in the cache memory, and to search in smaller sorted containers to reduce the number of cache faults.

am