This advice assumes modern cpus with:
- fast caches
- much slower memory latency compared to the clock speed.
- reasonable branch prediction (truly amazing in the latest desktop/server processors)
I would suggest that hybrid structures may well trump a single structure.
Using simple array based key value pairs with O(N) access as mentioned but very low constant factors and extremely good caching behaviour. This initial structure should be small (likely no bigger than 16 and possibly 8 values) to avoid going beyond a single cache line. Regrettably is a parameter you would need to tune yourself.
Once you go beyond that number you would want to fall back to a structure with better O(N) behaviour, I would suggest trying a decent hash table to start with since this will likely be reasonable from the 16 - several thousand range and if you tend to look up similar values more often will tend to stay in the faster caches.
If you also remove as well as insert you must take care not to thrash back and forth between the two states. Requiring the count to shrink to one half the cut-off for 'upgrading' to the secondary structure should prevent this but remember that any deterministic cross over behaviour will be susceptible to worst case behaving inputs.
This may be an issue if you are trying to defend against malicious input data. If so using a randomising factor in the decision protects against it. It is likely you don't care about this though since you didn't mention it.
If you want you might try making the initial primary array sorted, allowing for a binary search which is O(log(N)) but at the cost of a more complex search code. I would think that the simple array walk will actually beat it, but you would want to benchmark this for differing values of N, it may allow you to stick with a primary array for longer, but I would think this is a function of the size of the cache line size more than the O(N) behaviour.
Other options include:
- Treating all key values < 256 differently and storing them in a byte
->
struct pair of arrays saving space on the keys (and potentially allowing them to remain there when you switch over to the secondary structure) this may perform badly due to the need to unpack the array on the fly to the native word length.
- using a trie like structure doing a byte at a time of the key. I doubt the complexity of this will make it perform well in practice
Once again I will repeat the very good advice from kmkaplan. Benchmark it thoroughly avoiding microbenchmarks. In this sort of analysis the real numbers can be surprisingly different from the theory...