views:

86

answers:

3

I'm using a hash table (DotNET Dictionary object) as part of a sparse two-dimensional data set. Most of the entries in the hash table will be close together. I'll probably end up with 100 ~ 10,000 entries, all of them clustered near zero. I've read that a Hash table performs better when the hashes are spread across the entire integer(32 bit) range.

Is there a cheap way to map consecutive integers onto wildly different values in a 1:1 fashion? I don't have to map them back, it's purely a one-way thing.

+1  A: 

If you know the maximum value of your keyset (kmax), you could expand by a constant factor (multiplier), say multiply by a fixed prime number that keeps the product below the max integer size (2^31 - 1):

i.e. the nearest prime number to (2^30) / kmax

Note: make sure the prime used is not the same as the number of buckets in the Hash table.

Here is another solution: Since the .NET Random class will generate the same value for the same seed, you could use that to distribute the incoming keys.

Mitch Wheat
Interesting solution. I can be reasonably sure that integers will remain low, but this class is going into an SDK so I hesitate to make this a hard constraint.
David Rutten
+1  A: 

Instead of using Integer, write a class that Inherits from Integer, and override the GetHashCode function. This way you don't have to do anything but create this function!

The easiest way I can think of to spread out the values evenly is to do something like:

public class MyInteger:Integer
{
    public override int GetHashCode()
    {
       unchecked
       {
           return (int)Math.Pow(this,this);
       }
    }
}

Nice and evenly split up, while keeping the effort to a minimum.

Erich
@Erich, thanks, but is this guaranteed to provide a unique mapping of every single possible integer?
David Rutten
It won't, but assuming they are reasonably clustered they will be. Also, you don't NEED a unique mapping the way a hashtable works. They will be spread out enough that your speed will be quite fast.Other than the range, you might be better off just using an Array, with an int-index. Each empty spot only takes 4 bytes, so it won't be terribly large. You mention it is 10,000 as the highest value, so that is only 40,000 bytes, or 40k for the whole array, which will have an O(1) search and insert time.if you are willing to give up a k of memory, it would be best done that way.
Erich
Good point, I don't mind spending a lot of memory on this since it will only need that memory very briefly. I'll try this approach as well and see if it outperforms the hash-table way. Thanks!
David Rutten
+2  A: 

Maybe I'm misunderstanding what you are saying, but Dictionary will already hash your integers. There shouldn't be any need to pre-hash them. Why not try out the default implementation and see how it goes instead of attempting a pre-optimizion that will in all likelihood be pointless.

dangph
That is a good point!
Mitch Wheat
If you disassemble the Int32 type, the hash code is merely the number itself. Hash tables work better if the hash values are spread across the entire range. You're right of course, I should try both and see if it makes a difference at all, but for me to try it both ways, I need a way to remap the integers.
David Rutten
@David, fair enough. There are some hash functions here: http://stackoverflow.com/questions/664014/what-integer-hash-function-are-good-that-accepts-an-integer-hash-key
dangph