views:

431

answers:

3

I am reading up the code of the HashMap class provided by the java 1.6 API and unable to fully understand the need of the following operation (found in the body of put and get methods) :

int hash = hash(key.hashCode());

where the method hash() has the following body:

 private static int hash(int h) {
         h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

This effectively recalculates the hash by executing bit operations on the supplied hashcode. I'm unable to understand the need to do so even though the API states it as follows:

This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions for hashCodes that do not differ in lower bits.

I do understand that the key value pars are stored in an array of data structures, and that the index location of an item in this array is determined by its hash. What i fail to understand is how would this function add any value to the hash distribution.

Any help would be appreciated!

+1  A: 

I somewhere read this is done to ensure a good distribution even if your hashCode implementation, well, err, sucks.

Helper Method
Right, and the default hashcode() implementation in java.lang.Object doesn't have much distribution between hashes.
Sam Barnum
This is true, however more explanation/citation/link would be nice...
pajton
What i dont understand is that if each hash is unique (and the method in question does not - and cannot - address the problem of unique hashes), what problems does the mechanism face? It mentions something about collisions in lower order bits - but that's not very clear.
Varun Garde
pgras
+3  A: 

As Helper wrote, it is there just in case the existing hash function for the key objects is faulty and does not do a good-enough job of mixing the lower bits. According to the source quoted by pgras,

 /**
  * Returns index for hash code h.
  */
 static int indexFor(int h, int length) {
     return h & (length-1);
 }

The hash is being ANDed in with a power-of-two length (therefore, length-1 is guaranteed to be a sequence of 1s). Due to this ANDing, only the lower bits of h are being used. The rest of h is ignored. Imagine that, for whatever reason, the original hash only returns numbers divisible by 2. If you used it directly, the odd-numbered positions of the hashmap would never be used, leading to a x2 increase in the number of collisions. In a truly pathological case, a bad hash function can make a hashmap behave more like a list than like an O(1) container.

Sun engineers must have run tests that show that too many hash functions are not random enough in their lower bits, and that many hashmaps are not large enough to ever use the higher bits. Under these circumstances, the bit operations in HashMap's hash(int h) can provide a net improvement over most expected use-cases (due to lower collision rates), even though extra computation is required.

tucuxi
+1 Wow, really good answer, much much better than mine -,-
Helper Method
"just in case"? Actually, most hash codes in Java are going to be crappy. Just look at java.lang.Integer, for instance!But this actually makes sense. It's better to say "it's okay if everyone's Object.hashCode()s have crappy bit distribution, as long as they follow the equal-objects-have-equal-hashcodes rule, and try to avoid collisions as much as possible." Then only collection implementations like HashMap have the burden of passing those values through a secondary hash function, instead of it being everyone's problem.
Kevin Bourrillion
A: 

Are you sure this is still the case? I looked at the code in the link given above and the only operations I see that look like this are in the calculate capacity function.

http://www.dreamincode.net/forums/topic/189119-javautil%3Bhashmapcalculatecapacityin-x/

John Creighton