views:

194

answers:

7

I have ID values of the type unsigned int. I need to map an Id to a pointer in constant time.


Key Distribution:

ID will have a value in the range of 0 to uint_max. Most of keys will be clustered into a single group, but there will be outliers.


Implementation:

  • I thought about using the C++ ext hash_map stuff, but I've heard their performance isn't too great when keys have a huge potential range.

  • I've also thought of using some form of chained lookup (equivalent to recursively subdividing the range into C chucks). If there are no keys in a range, that range will point to NULL.

    N = Key Range

    Level 0 (divided into C = 16, so 16 pieces) = [0, N/16), [N/16, 2*(N/16)), ...

    Level 1 (divided into C = 16, so 16 * 16 pieces) = ...


Does anyone else have ideas on how this mapping can be more efficiently implemented?

Update:

By constant, I just meant each key lookup is not significantly influenced by the # of values in the item. I did not mean it had to be a single op.

+11  A: 

Use a hash map (unordered_map). This gives ~O(1) look-up times. You "heard" it was bad, but did you try it, test it, and determine it to be a problem? If not, use a hash map.

After your code gets close to completion, profile it and determine if the look-up times are the main cause of slowness in your program. Chances are, it won't be.

GMan
+1  A: 

You're not going to get constant time.

I'd probably use a B+Tree

Tom
A hash map is constant time, most of the time.
GMan
@Gman: It depends on the hash, and the keys.
kibibu
And the number of buckets
kibibu
Yup...? And most of the time it'll be very good.
GMan
It's still amortized O(1) on a statistically random distribution of keys.
Pavel Minaev
I still feel that saying "it's amortized O(1)" is a great way to hide true performance. It could potentially be as bad as O(N^2) for N objects (even though very unlikely), but hey it's O(1) amortized!This also assumes constant time object hashing method, which may not always be true.
ttvd
+1  A: 

If your integer values are 32 bits wide, then you could use a 64-bit platform, allocate 32 gigabytes of memory (8 bytes per 4 billion pointers), and use a flat array. That will be as close as you're going to get to constant lookup time.

Greg Hewgill
Side note: The fact that this is even *possible* today is pretty mind-boggling for those of us who grew up in an era where 64 kilobytes was a fully-decked-out machine.
Greg Hewgill
By constant, I just meant each key lookup is not significantly influenced by the # of values in the item. I did not mean it had to be a single op.
jameszhao00
Constant time means irrespective of number values you get the same time, what this comment describes is linear time.
Tom
Whole new meaning to "perfect hashing" :)
ttvd
Can you explain why this is linear? I'm expecting each get() op to take a near constant number of instructions (for example the chained lookup proposal I posted in the question)
jameszhao00
@Tom: "Linear time" is a situation where the lookup time depends directly on the number of items in the container. My suggestion doesn't do a search at all, but a single index: `a[id]` where `id` is the identifier to look up and `a` is the big array.
Greg Hewgill
@Greg: can you spell “cache miss”? There goes your constant time, out of the window … ;-)
Konrad Rudolph
@Konrad: I was wondering whether anybody would pick up on that! The OP did say "Most keys will be clustered into a single group"... :)
Greg Hewgill
Oh well, you win.
Konrad Rudolph
+1  A: 

Reserve 4GB of your RAM for this, and simply cast your uint to the pointer. That's definitely constant time.

Zed
+4  A: 

If you want a tree-based solution and your ids are in the range {0..n-1} then you can use a very cool data structure called van Emde Boas tree. This will yield all operations in O(log log n) and use O(n) space.

ttvd
Hey, that is cool
kibibu
In my experience very painful to implement :) But yeah it's very impressive.
ttvd
+1  A: 

As GMan suggests an unordered_map is probably a good solution. If you are concerned about a large number of collisions in this hash map, then use a hash function that will remove the clustering of your data. For example, you could swap the bytes around.

A good point to note is that you will probably spend more time debugging and proving a custom data structure than one that's already got good pedigree.

PinkTriangles
+1  A: 

How many items are to be in such a map and how often is it changed?

If all values fit into the processor's cache, then a std::vector<std::pair<unsigned int,T*>> with presorted values and binary search might be fastest despite the access being O(N).

sbi
Around 200k items will be in the lookup.
jameszhao00
With a 32bit `int` and a 32bit pointer, that would be 1.6MB. I have no experience with this, but before I'd go and implement something like a vEB tree, I'd pick some integers that hash very well and try to find out how a sorted `std::vector` with a binary search compares to `std::unordered_map` performance-wise.
sbi