views:

165

answers:

5

I must admit to having only a rudimentary understanding of how HashTables work, although from what little I do know it seems fairly straightforward. My question is just this: it seems that the conventional wisdom is to use simple, basic value types such as integers for the keys in a HashTable. However, strings are also often used, even though in many languages they are implemented as reference types. What I feel is generally not encouraged is using complex reference types; I'm guessing this is because doing so would necessitate a slower hash function? But then why are strings so commonly used? After all, isn't a string internally a char[] array (again, in most languages)?

In the end, what value types are generally regarded as the "best" (or even simply "acceptable") choices to use as keys in a HashTable? And are there any commonly used choices that are actually regarded as "bad" (like strings, possibly)?

+1  A: 

The best hash keys are those that

  1. Have good (as in low collisions) hashes (see Object.GetHashCode for .NET, Object.hashcode for Java)
  2. Have quick comparisons (for when there are hash collisions).

All that said, I think Strings are good hash keys in most cases, since there are many excellent hash implementations for Strings.

C. Ross
+3  A: 

As long as a suitable hash function is provided all types will do as keys. Remember after all a hash table is just a linear array. The hash function takes a key of a certain type and computes an index in the hash table array (called bucket) where the value gets stored (there are some issues with collisions though).

So the real tricky part is finding a hash function. Of course it should have certain properties, like being simple to compute, chaotic (nearly identical keys should be mapped to completly different hash table buckets), deterministic (same keys means same hash table bucket), uniformity (all possible keys are mapped evenly to the buckets), or surjectivity (all buckets of the hash table should be used).

It seems it is easier to define such a function for simple types like integers.

spa
wrong ! the real issue is key mutability !
Gyom
That is indeed true. However it is a definiton what keys are considered equal and which are not.
spa
+4  A: 

Most string implementations, while they might appear as references types in managed environments their implementation is typically an immutable type.

What the hash function does is that it maps a very large number of states onto a smaller number of states.

That is why string hashing is good for testing string equality. You can map the value to an index of an array, and look up some information about that value very quickly. You don't need to compare every character with every other character in every other string. And you can say just about the same thing about anything. It's all about reducing, or fingerprinting an arbitrary number of bytes in some manner which is useful.

This is where the discussion about the type of key you use in a hash table becomes invalid, because it's the mapping of that value into a smaller state space and how that's utilized internally which makes it useful. An integer is typically hardware friendly, but 32-bits isn't really a large space and collisions are likely within that space for arbitrary inputs.

In the end, when you do use a hash table, the cost of calculating the hash value is irrelevant compared to the time it would take to compare every value with every other value in every other possible position (assuming that your hash table contains hundreds of items).

John Leidegren
I do understand that a hash function works by mapping a (potentially) large value to a smaller space, but doesn't the speed of a hash function also depend on the size of its input? That's kind of why I assumed it's typically discouraged to use large reference types as keys. If that is not the case, though, then I wonder why this would be discouraged after all (maybe it's because the developer needs to the implement his/her own efficient hash function?).
Dan Tao
As I said, many managed environments implement strings as immutable types. And when you have an immutable type, the hash code doesn't need to be computed every time because the value is constant (once created). Typically, you would only have to pay the cost of producing the hash code, once, for every unique string. e.g. the .NET runtime maintains an internal string pool to accomplish this. However, the cost of producing a hash code from an unknown string is there, but the cost is related to the length of the string used as key not the size of the collection (or hash table).
John Leidegren
This is quite counterintuitive to me. Are you saying that, if I add an item to a HashTable, then later go to retrieve that item by key, the runtime will magically know the hash code for that key without having to perform the hash function? How can this be?
Dan Tao
yeah, well, obviously not but something like that. It's not magic but it's a side-effect on how strings are treated by the runtime. The idea is that all identical string values are only represented once in memory. The runtime is still paying to cost of maintaining the string pool but you never end up paying the cost of hashing a string more than once.
John Leidegren
+1 Detailed explanation, and responding to user comments.
C. Ross
So you're saying the runtime maintains its own string-like objects internally, each of which has a cached value for its hash code (right?)... Here's what I *still* don't understand: if I instantiate a new string--or rather, what looks to me like a new string even if the runtime really treats it as the same string--how can the runtime know that it already has that string in memory, without using a HashTable? The only way I would know of to determine if I have a cached hash code would be by looking up the key in a HashTable... which would require calculating the hash code... am I making sense?
Dan Tao
yes ! it does indeed require re calculating the hash; and this is why the only fundamental criterion about keys is the (im)mutability of keys.The important point is that once we have computed the hash signature of an object, it does not change ever again. so if we place it in a certain bucket, this placement remains valid forever.On the other hand, if your objects can mute (i.e. I'm holding a reference to an object, but its ''value'' changes over time, like with C strings) then the signature I have computed yesterday may not be valid tomorrow. And I will not be able to find my bucket again
Gyom
@Dan - It's a bit messy (I left out a few details), and you should probably go to the respective technical documentation for your platform/environment to verify this, but normally the runtime would maintain a balanced hash table of all strings in the system. And when you create a new string you actually get a reference to something in the string pool and the hash code has already been computed for that string (because the runtime did that).
John Leidegren
This depends a lot on language - C++ implementations never (AFAIK) maintain a pool of canonical strings. Java maintains one, but only uses it if you ask it to (string literals and when you call intern()). Scripting languages I don't know so well: interning strings slows down string creation but speeds up string comparison, so that's the trade-off. "how can the runtime know that it already has that string in memory, without using a HashTable" - it could for example use a Trie. Obviously however you store them, though, there's a lookup cost associated with canonicalisation.
Steve Jessop
"C++ implementations never (AFAIK) maintain a pool of canonical strings." - even when they're using copy-on-write, I mean.
Steve Jessop
+1  A: 

If you were to use a complex type as a key then:

  • It would be hard for the hash table implementation to group items into buckets for fast retrieval; how would it decide how to group a range of hashes into a bucket?
  • The hash table may need to have intimate knowledge of the type in order to pick a bucket.
  • There is the risk of the properties of the object changing, resulting in items ending up in the wrong buckets. The hashes must be immutable.

Integers commonly used because they are easy to split into ranges that correspond to the buckets, they are value types and therefore immutable, and they are fairly easy to generate.

GraemeF
+3  A: 

It's not a matter of strings versus integers, or value versus reference, but of mutable keys versus immutable keys. As long as the keys are immutable (and thus their hashing value never change) they are OK to index a hash table. For instance, strings in Java are immutable and thus perfectly suited as hashtable keys.

By the way, if a data type is simpe enough to be always passed by value (like scalars), then it will of course be OK.

But now imagine that you use a mutable type ; if you give me a reference to one of these objects as a key, I will compute its hash value and then put it in one of my hashtable buckets. But when you later modify the object, I will have no way to be notified ; and the object may now reside in the wrong bucket (if its hash value is different).

Hope this helps.

Gyom
This is a very helpful answer; but I'm still wondering if there are certain types that are "better" to use as keys than others. For instance, assuming I've defined a class which is actually immutable and will persist with the same hash code for its entire existence. Is it then perfectly fine to use as a key, or would it still be better to use something like an integer (for performance reasons)? It seems to me like the full, comprehensive answer is likely to be a combination of yours (keys must be immutable types) and spa's (types used as keys should have efficient hash functions)...
Dan Tao
@Dan: a particular hash table needs to store what it needs to store. If you're maintaining a web cache, you're storing content for URLs. The key has to be a URL, it can't be an integer, because you aren't looking up integers. Obviously faster is "better", but "does what I want slowly" is always "better" than "does something that's really fast but completely useless" :-)
Steve Jessop