views:

140

answers:

4

I have check out this Wikipedia page on it, but I still don't understand it. Can someone please help my dim-witted mind to understand the concepts of hashing, hashtable/hashmap, and hash functions? Some examples would really help.

+7  A: 

The Wikipedia article will have a lot of technical information, but a simplistic view of hashing is something like the following.

Imagine that there's a magical function that can give a number to any object. Given the same object, it always return the same number.

Immediately now you have a quick way to test if two objects are the same: ask this function for their numbers and compare. If they're different, then they're not the same.

But what if they have the same number? Could two different objects have the same number?

Yes, this is possible in most scenario. Let's say that the function can only give numbers between 1..10, for example, and there are 100 different objects. Then of course some different objects must have the same number. This is what is called a "collision". A "collision" makes our quick equality test not as useful, so as much as possible we want to minimize its happening. A good magical function is one that would try to minimize the number of "collision".

So what else can you do with this number? Well, you can use it to index an array. Given an object, you can put it at the index given by the number from this magical function. This array is essentially what a hashtable is; this magical function is a hash function.

polygenelubricants
+1  A: 

This book (and supporting video lectures) provide some great explanation of algorithms and data structures. There are some lectures about hash functions (1, 2). I'd recommend that.

Cormen

Also, just FYI, hashCode(), called on an instance of Object class returns an address of this particular instance in memory. Not really true, as pointed by polygenelubricants in comments.

folone
FYI, your FYI is a half-truth. From http://java.sun.com/javase/6/docs/api/java/lang/Object.html#hashCode%28%29 - "`converting` the internal address of the object into an integer/this implementation technique is `not required`"
polygenelubricants
I'm intrigued. Could you please be more specific? :)
folone
But isn't this a recommendation for classes, that override this method? Instead I'm talking about instances of the `Object` class itself.
folone
@folone: Read the whole sentence. That quote is directly about `Object.hashCode`. So two things are: (i) it's not the address, it's a conversion from the address (ii) that's only typical, not mandatory implementation. Keep in mind that objects can be moved around in the address space too, what with garbage collection and virtual machine etc, so there's definitely a few layers of abstraction there.
polygenelubricants
okay, I'll remove that.
folone
A: 

A hashtable is basically a way to store anything in an array and retrieve it almost as fast as looking up something in an array via an index, without wasting too much space.

The job of a hash function is (in this context) to compute the array index at which an object will be stored, based on the contents of the object. That means, it must always return the same result for the same object, and should return different results for different objects as much as possible. When two different objects have the same hash, it's called a "collision", and you have to treat those cases specially, which makes the whole thing slower.

Michael Borgwardt
+1  A: 

A hash function is a way to create a compact representation of an arbitrarily large amount of data. In java with the hashcode method this means somehow describing the state of your object (no matter how large) in an int (4 bytes). And is usually written to be a fairly fast as explained below.

To simplify in hashtables/hashmaps the hashcode serves as sort of a cheap equals. Take two objects a and b of type Foo lets says to figure out if a.equals(b) takes 500 ms where as calculating a (efficient) hashcode only take 10ms. So if we want to know if a.equals(b) instead of doing that directly first we will look at the hashcodes and ask does a.hashCode() == b.hashCode(). Note that this will take only 20ms in our example.

Because of the API definition of hashcode we know that if the hashcode of a is not equal to b then a.equals(b) should never be true. So in our above test if we see the hashcodes are unequal then we never need to do the longer .equals() test, this is why you should always override hashCode and equals together.

You may also see references about writing "good" or "well distributed" hashcodes. This has to do with the fact that the inverse of the previous statements about hashcode and equals is not true. More specifically a.hashCode() == b.hashCode() does not necessarily imply a.equals(b) So the idea of a good hashcode is you reduce the likelyhood of a.hashCode() == b.hashCode() when a.equals(b) is false. You may have seen this referred to as a collision of a hash function.

Back to hashmaps/tables. These are based on key/value pairs. So when you add or retrieve a value you will supply a key. So the first thing the map has to do is look for the key, which means finding something that .equals() the key you provide. But as we discussed above .equals() can be incredibly slow which means comparisons can be greatly sped up by checking hashcodes first. Since when the hashcodes are well distributed you should know quickly when x is definitely != y.

Now in addition to the comparison hashmaps/tables actually use the hashcodes to organize their internal storage of the data, however I think that is beyond the scope of what you are looking to understand at this point.

M. Jessup