tags:

views:

392

answers:

5

Does anyone know how Java implements its hash tables (HashSet or HashMap)? Given the various types of objects that one may want to put in a hash table, it seems very difficult to come up with a hash function that would work well for all cases.

+6  A: 

Java depends on each class' implementation of the hashCode() method to distribute the objects evenly. Obviously, a bad hashCode() method will result in performance problems for large hash tables. If a class does not provide a hashCode() method, the default in the current implementation is to return some function (i.e. a hash) of the the object's address in memory. Quoting from the API doc:

As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the JavaTM programming language.)

Jim Garrison
hashcode() can return totally different ranges of values (eg 1 to 100 vs 1 to 100,000). So how does Java know how many buckets it needs to create for a hash table?
escalon
@escalon: hashcode() returns an int (which always has range -2^31 to 2^31-1). You modulo the hash code by the current number of buckets for the table. This is how these hash tables usually work. The number of buckets can change as the hash table grows.
newacct
escalon, your original question was how java implements hashcode, while your comment here is more on the theory of hash table design. You may want to check out the wikipedia articles on hashing.
Jherico
If you try `Object.hashCode` you should see that it clearly does not return the memory address. (On reasonable implementations, the value is stored in the object header. Sun's implementation initialises the value on first use, using a slight rehash of the memory address at the time of initialisation.)
Tom Hawtin - tackline
@tackline You are of course correct. I've edited the answer.
Jim Garrison
+6  A: 

You can check the source of HashMap, for example.

JG
+1  A: 

There are two general ways to implement a HashMap. The difference is how one deals with collisions.

The first method, which is the one Java users, makes every bucket in a the HashMap contain a singly linked list. To accomplish this, each bucket contains an Entry type, which caches the hashCode, has a pointer to the key, pointer to the value, and a pointer to the next entry. When a collision occurs in Java, another entry is added to the list.

The other method for handling collisions, is to simply put the item into the next empty bucket. The advantage of this method is it requires less space, however, it complicates removals, as if the bucket following the removed item is not empty, one has to check to see if that item is in the right or wrong bucket, and shift the item if it originally has collided with the item being removed.

brianegge
`IdentityHashMap` and the hash map used in `ThreadLocal` use the probing algorithm (in Sun's JRE).
Tom Hawtin - tackline
+1  A: 

HashMap and HashSet are very similar. In fact, the second contains an instance of the first.

A HashMap contains an array of buckets in order to contain its entries. Array size is always powers of 2. If you don't specify another value, initially there are 16 buckets.

When you put an entry (key and value) in it, it decides the bucket where the entry will be inserted calculating it from its key's hashcode (hashcode is not its memory address, and the the hash is not a modulus). Different entries can collide in the same bucket, so they'll be put in a list.

Entries will be inserted until they reach the load factor. This factor is 0.75 by default, and is not recommended to change it if you are not very sure of what you're doing. 0.75 as load factor means that a HashMap of 16 buckets can only contain 12 entries (16*0.75). Then, an array of buckets will be created, doubling the size of the previous. All entries will be put again in the new array. This process is known as rehashing, and can be expensive.

Therefore, a best practice, if you know how many entries will be inserted, is to construct a HashMap specifying its final size:

new HashMap(finalSize);
Sinuhe
Technically depends upon implementation. It used to be popular to attempt a prime size of table. (Also, IIRC, the constructor to `HashMap` takes the capacity, so you should divide by the load factor and round up, but I think specifying the capacity is almost always more effort than it is worth.)
Tom Hawtin - tackline
A: 

In my own words:

An Entry object is created to hold the reference of the Key and Value.

The HashMap has an array of Entry's.

The index for the given entry is the hash returned by key.hashCode()

If there is a collision ( two keys gave the same index ) , the entry is stored in the .next attribute of the existing entry.

That's how two objects with the same hash could be stored into the collection.

From this answer we get:

   public V get(Object key) {
       if (key == null)
           return getForNullKey();
       int hash = hash(key.hashCode());
       for (Entry<K,V> e = table[indexFor(hash, table.length)];
            e != null;
            e = e.next) {
           Object k;
           if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
               return e.value;
       }
       return null;
   }

Let me know if I got something wrong.

OscarRyz