tags:

views:

259

answers:

2

Given that .Net has the ability to detect bitness via IntPtr (looking through reflector a good amount of it is marked unsafe, though - shame) I've been thinking that GetHashCode returning an int is potentially short-sighted.

I know that ultimately with a good hashing algorithm the billions of permutations offered by Int32 are absolutely adequate, but even so, the narrower the possible set of hashes the slower hashed key lookups are as more linear searching will be required.

Equally - am I the only one who finds this amusing:

struct Int64{
  public override int GetHashCode()
  {
    return (((int) this) ^ ((int) (this >> 0x20)));
  }
}

Whilst Int32 simply returns this.

If IntPtr is out of question because of performance concerns, perhaps an IHashCode that implements IEquatable etc is better?

As our platforms get larger and larger in terms of memory capacity, disk size etc, surely the days of 32 bit hashes being enough are potentially numbered?

Or is it simply the case that the overhead involved in either abstracting out the hash via interfaces, or adapting the size of the hash according to the platform outweighs any potential performance benefits?

+5  A: 

The Int64 hash function is there to make sure that all the bits are considered - so basically it is XORing the top 32 bits with the bottom 32 bits. I can't really imagine a better general-purpose one. (Truncating to Int32 would be no good - how could you then properly hash 64-bit values which had all zeros in the lower 32 bits?)

If IntPtr were used as the hash return value, then code would have to have conditional branches (is it 32-bit? is it 64-bit? etc), which would slow down the hash functions, defeating the whole point.

I would say that if you have a hashtable which actually has 2 billion buckets, you're probably at the stage of writing an entire custom system anyway. (Possibly a database would be a better choice?) At that size, making sure the buckets were filled evenly would be a more pressing concern. (In other words, a better hash function would probably pay more dividends than a larger number of buckets).

There would be nothing to stop you implementing a base class which did have an equivalent 64-bit hash function, if you did want a multi-gigabyte map in memory. You'd have to write your own Dictionary equivalent however.

stusmith
+1 for pragmatism
kdgregory
yes I understand that ^ing them together makes sure all the bits get considered - makes a lot of sense. Interestingly, if you look at IntPtr - used for things like method handles - it simply truncates to an int. That's great if you've got handles to data in the upper 32 bits of memory and you're using them for keys!I take your point about conditional branching - you couldn't make the 32-bit/64-bit hash transparent to the code that generates it. I also take your point about writing a new data structure to store more data - which I guess is where you would ultimately have to take it.
Andras Zoltan
+3  A: 

You do realize that the hash code returned by GetHashCode is used for addressing in a hash table? Using a bigger data type would be a futile exercise since all hash tables are smaller anyway. Additional information would simply be wasted because it cannot be used adequately.

Common hash tables have in the order of a few thousand to a few million entries. A 32 bit integer is more than sufficient to cover this range of indices.

Konrad Rudolph
Well - that's not quite true - a hashcode of 2,034,242,111 doesn't get used as an index . Unlike an array, there is nothing other memory restricts a hashtable in size - theoretically there's no reason it couldn't have 10 billion elements, even with a 32 bit hash. It's just memory restrictions. Bring on a machine with couple of hundred gigs of ram (okay, let's say a terabyte) and we could fill it with such a huge hashtable. Wheher you would - or create some other structure, however - is another story!
Andras Zoltan
@Andras: how is that any different from a normal array (hint: it isn’t). And yes, you *could* have 10 billion elements – just as with a normal array but that just doesn’t scale on *any* current architecture. Convolving the whole .NET architecture for the *one* machine worldwide that can handle 1 TeB of main memory doesn’t sound like a good trade-off to me. The point is: architectures necessarily involve trade-offs and doubling the size of an address may be a big deal.
Konrad Rudolph