views:

364

answers:

10

Here are some constraints for a data structure I need. It seems like none of the common data structures (I will mention the ones I've thought of below) fit these all that well. Can anyone suggest one that I maybe haven't thought of?

  1. I need to be able to perform lookups by unsigned integer keys.
  2. The items to be stored are user-defined structs.
  3. These indices will be sparse, usually extremely so. Regular arrays are out.
  4. The frequency of each index will have a non-uniform distribution, with small indices being much more frequent than large indices.
  5. N will usually be small, probably no larger than 5 or 10, but I don't want to rely on that too heavily because it might occasionally be much larger.
  6. The constant term matters a lot. I need really fast lookups when N is small. I already tried generic hash tables and, empirically, they are too slow, even when N=1, meaning no collisions, probably because of the amount of indirection involved. However, I'd be open to suggestions about specialized hash tables that take advantage of other constraints mentioned.
  7. Insertion time is not important as long as retrieval time is fast. Even O(N) insertion time is good enough.
  8. Space efficiency is not terribly important, though it is important enough not to just use regular arrays.
+4  A: 

When N is small a simple array or single linked list with key + value as payload is very efficient. Even if it is not the best when N gets larger.

You get O(N) lookup time which means lookups take k * N time. A O(1) lookup takes a constant K time. So you get better performance with O(N) for N < K/k. Here k is very small so you can get to interesting values of N. Remember that the Big O notation only describes behavior for large Ns, not what you are after. For small tables

void *lookup(int key_to_lookup)
{
  int n = 0;
  while (table_key[n] != key_to_lookup)
    n++;
  return table_data[n];
}

can be hard to beat.

Benchmark your hash tables, balanced tree and simple array/linked list and see at which values of N they each start to be better. Then you will know which is better for you.

I almost forgot: keep the frequently accessed keys at the beginning of your array. Given your description that means keep it sorted.

kmkaplan
+1  A: 

You might try to combine the best of both worlds: If the key is small, put it into an array-like data structure that does not grow larger than a pre-defined maximum key. If the key is large, put it into a hashtable.

Timbo
+2  A: 

A hash table lookup is about as fast as it CAN be:

The only thing that distinguishes it from a regular array lookup is the computation of the hash and (if your hashfunction is good enough, or you spend enough time to generate a optimal hash function during insertion, which would make your insertion take O(N)) then essentially a array lookup.

Essentially since it might happen (unless you use an optimal hash function) that you have to rehash or follow a very small linked list.

Since most hash functions used for hash tables are of k*c_1 % c_2, the difference to an array lookup in a rather sparse and/or optimal hash table consists of one indirection, two multiplications, a subtraction and a division (an efficient modulo implementation using the cpus capabilites might reduce that by a subtraction and a multiplication) and the array lookup.

It simply does not get faster than that.

dionadar
sorry but he specifically mentioned constant factors and hash tables can actually be very slow for that since they require checking both the hash and the equality, degrade badly if the hash is poor, allow no use of domain specific knowledge, like is present here and can have poor cache behaviour
ShuggyCoUk
"degrade badly if the hash is poor, allow no use of domain specific knowledge" If the hash is poor in a circumstance where you know to be heavily using a small subset of unsigned integers - then you probably chose the wrong hash?
dionadar
designing a good hash function for specific data distributions is hard (let's go shopping) the normal avalanching chi square testing is not such a great indicator since the distribution domain is so tight.
ShuggyCoUk
Oh and hash tables with reasonable load factors always waste some space. In small collections that can be the difference between filling one cache line or two. This can overwhelm performance numbers...
ShuggyCoUk
A: 

I'd consider a hashtable that handles hash collisions with a self-balancing binary tree instead of simple chaining. You should be able to get O(1) amortized look up over all keys and worst case lookup of O(logN). Since your key distribution is skewed, it's likely that you'll have collisions with low values of the index and the tree lookup will really pay off there.

tvanfosson
Why should it be likely? At least using the TRS1 hashmaps you can specify your hash function?
dionadar
Since they are integer keys I was assuming a naive hash function (like modulo). Using a more complex hash function could solve the problem.
tvanfosson
My assumption was also based on his stated experience with the hashtable.
tvanfosson
if he is using a distribution where low values dominate a very simple function like k % (mapsize*2) should actually perform very well? This would essentially change the hashmap to be a array indexing where high key values are popped down into the standard range.
dionadar
+3  A: 

This advice assumes modern cpus with:

  • fast caches
  • much slower memory latency compared to the clock speed.
  • reasonable branch prediction (truly amazing in the latest desktop/server processors)

I would suggest that hybrid structures may well trump a single structure.

Using simple array based key value pairs with O(N) access as mentioned but very low constant factors and extremely good caching behaviour. This initial structure should be small (likely no bigger than 16 and possibly 8 values) to avoid going beyond a single cache line. Regrettably is a parameter you would need to tune yourself.

Once you go beyond that number you would want to fall back to a structure with better O(N) behaviour, I would suggest trying a decent hash table to start with since this will likely be reasonable from the 16 - several thousand range and if you tend to look up similar values more often will tend to stay in the faster caches.

If you also remove as well as insert you must take care not to thrash back and forth between the two states. Requiring the count to shrink to one half the cut-off for 'upgrading' to the secondary structure should prevent this but remember that any deterministic cross over behaviour will be susceptible to worst case behaving inputs.
This may be an issue if you are trying to defend against malicious input data. If so using a randomising factor in the decision protects against it. It is likely you don't care about this though since you didn't mention it.

If you want you might try making the initial primary array sorted, allowing for a binary search which is O(log(N)) but at the cost of a more complex search code. I would think that the simple array walk will actually beat it, but you would want to benchmark this for differing values of N, it may allow you to stick with a primary array for longer, but I would think this is a function of the size of the cache line size more than the O(N) behaviour.

Other options include:

  • Treating all key values < 256 differently and storing them in a byte -> struct pair of arrays saving space on the keys (and potentially allowing them to remain there when you switch over to the secondary structure) this may perform badly due to the need to unpack the array on the fly to the native word length.
  • using a trie like structure doing a byte at a time of the key. I doubt the complexity of this will make it perform well in practice

Once again I will repeat the very good advice from kmkaplan. Benchmark it thoroughly avoiding microbenchmarks. In this sort of analysis the real numbers can be surprisingly different from the theory...

ShuggyCoUk
A: 

You could try an open-addressed hash with quadratic probing instead of seperate chaining, if your N is usually small. You'd need to reallocate from, say, an initial size of 32 to larger widths if you get the rare N case that overfills it. Linear probing or cuckoo hashing will give you good performance if you can get the whole structure to fit in a few cache lines.

Honestly I'm surprised that even a standard hash table is giving you such miserable performance. Maybe you could profile into it to see just what's making it so slow -- if it's the hash function itself, use a simple one like a power-of-two modulus (eg, key & (N-1) where N is known to be 2^x) which will favor distributions centered around 0 anyway. If it's the dcache miss of chasing the seperate chain, write an implementation that stores the first four elements in each bucket in the bucket itself so that you at least get them quickly. How slow is N=1?

I'd store pointers to structs rather than the structs themselves in the bucket chains: if the structs are large, then walking a chain of them will have many cache misses. On the other hand, you can fit about 16 key/pointer pairs on a single cache line, and only pay for a miss when you find the correct element.

Crashworks
+1  A: 

The only explanation I can see for the described problem is that the hash function is way too complex. I would be inclined to a two-stage approach:

1) For small keys, a simple array of pointers. No hash or anything.

2) For keys that are bigger than the size of the table you allocate:

How about a very simple hash function that will spread out the clustered keys:

The left-order 5 bits (I'm assuming 32-bit integers. If it's 64-bit then add one more bit.) are the number of bits that actually contain data, the rest is simply the sum (discard carries) of the original key cut into chunks of however many bits you are using for the purpose and added together.

Note that the number of significant bits can be partially precalculated--build a 64k table of high bit values. If the high order word is non-zero, use it as an index to the table and add 16, otherwise use the low order word as the index. For 64-bit integers you obviously have to use 4 steps instead of two.

Loren Pechtel
A: 

You could consider Judy Arrays:

Judy is a C library that provides a state-of-the-art core technology that implements a sparse dynamic array. Judy arrays are declared simply with a null pointer. A Judy array consumes memory only when it is populated, yet can grow to take advantage of all available memory if desired ... Judy can replace many common data structures, such as arrays, sparse arrays, hash tables, B-trees, binary trees, linear lists, skiplists, other sort and search algorithms, and counting functions.

Jacob Gabrielson
Note that these are rather dependent on certain language features for full utility. notably pointers (the empty judy array is a null pointer).
ShuggyCoUk
Also I have yet to find any *recent* serious performance analysis on x86 cpu's the original tuning was done on HP cpu's with somewhat different cache implementations. older analysis: http://www.nothings.org/computer/judy/
ShuggyCoUk
no idea why you got a -1, here's a plus to offset
ShuggyCoUk
BTW, I agree that Judy arrays might not work as well as advertised, but they do (purport to) satisfy many of the requirements, so are probably worth a try if nothing else works.
Jacob Gabrielson
Judy is tuned for modern CPUs, see http://ub0.cc/nQ/kX.
ArtemGr
A: 

Here's a general idea for a hashing function. You said inserts can be costly.

Hash the key, which is an integer, with a simple modulus, stored with each instance of a hashtable

if an insert would cause a collision, re-optimize your hash-table by computing the number of collisions that would occur for each modulus in a reasonable range, say, the number of elements in your map through some constant multiple of that.

obviously, your inserts actually become quite expensive, around O(n^2) if you minimize allocations, but you will probably be able to achieve lookups with a single integer division and a single pointer indirection, and you know, because you computed it at insert time, what the worst case lookup will be.

TokenMacGuy
A: 

I would recommend a Skip list here. The java.util.concurrent package has a good implementation if you're into that.

Overflown