views:

1215

answers:

8

When overriding the equals() function of java.lang.Object, the javadocs suggest that,

"it is generally necessary to override the hashCode method whenever this method is overridden, so as to maintain the general contract for the hashCode method, which states that equal objects must have equal hash codes."

The hascode method must return a unique integer for each object (this is easy to do when comparing objects based on memory location, simply return the unique integer address of the object)

How should a hashCode() method be overriden so that it returns a unique integer for each object based only on that object's properities?


public class People{
   public String name;
   public int age;

   public int hashCode(){
      // How to get a unique integer based on name and age?
   }
}
/*******************************/
public class App{
   public static void main( String args[] ){
       People mike = new People();
       People melissa = new People();
       mike.name = "mike";
       mike.age = 23;
       melissa.name = "melissa";
       melissa.age = 24;
       System.out.println( mike.hasCode() );  // output?
       System.out.println( melissa.hashCode(); // output?
   }
}
+20  A: 

It doesn't say the hashcode for an object has to be completely unique, only that the hashcode for two equal objects returns the same hashcode. It's entirely legal to have two non-equal objects return the same hashcode. However, the more unique a hashcode distribution is over a set of objects, the better performance you'll get out of HashMaps and other operations that use the hashCode.

IDEs such as IntelliJ Idea have built-in generators for equals and hashCode that generally do a pretty good job at coming up with "good enough" code for most objects (and probably better than some hand-crafted overly-clever hash functions).

For example, here's a hashCode function that Idea generates for your People class:

public int hashCode() {
 int result = name != null ? name.hashCode() : 0;
 result = 31 * result + age;
 return result;
}
Marc Novakowski
If the hasCode method for two equal Objects has to return the same hashCode, doesn't that imply that all hashCode's must be unique? If object a and b return the same hasCode then there equal?, but there not, so their hasCodes must be unique.
ForYourOwnGood
No, equality is determined by equals. hashCode must be the same for two equals objects. It's valid to have two not-equals objects have the same hash code.
Steve Kuo
Valid but undesirable. This is called a hash collision. A good hash algorithm minimises collisions, but doesn't need to guarantee they won't occur.
slim
Just to clarify: "equal objects => equal hashcode" is not the same as "equal hashcode => equal object". What is true though is "nonequal hashcode => nonequal objects". That means a hashcode method that always returns 42 for any object is, by definition, a valid hashcode; it's just a very lousy one.
Zach Scrivena
It you use an IDE to calculate your hash function, it's worth "sense-checking" what it does. Multiplying by 31 (or some ther prime number) is genreally OK, but is bad if you have a field (e.g. a bit field) likely to consist of powers of 2 (as this shifts zeroes into the low bits when you multiply!).
Neil Coffey
BTW, the Java hash tables (HashMap, etc) re-hash incoming hash values to more evenly distribute. So, while striving to make values more unique is useful to prevent collisions, its usually not worth much effort to make them well-distributed.
Alex Miller
A: 

I think you misunderstood it. The hashcode does not have to be unique to each object (after all, it is a hash code) though you obviously don't want it to be identical for all objects. You do, however, need it to be identical to all objects that are equal, otherwise things like the standard collections would not work (e.g., you'd look up something in the hash set but would not find it).

For straightforward attributes, some IDEs have hashcode function builders.

If you don't use IDEs, consider using Apahce Commons and the class HashCodeBuilder

Uri
+5  A: 

I won't go in to the details of hashCode uniqueness as Marc has already addressed it. For your People class, you first need to decide what equality of a person means. Maybe equality is based solely on their name, maybe it's based on name and age. It will be domain specific. Let's say equality is based on name and age. Your overridden equals would look like

public boolean equals(Object obj) {
    if (this==obj) return true;
    if (obj==null) return false;
    if (!(getClass().equals(obj.getClass())) return false;
    Person other = (Person)obj;
    return (name==null ? other.name==null : name.equals(other.name)) &&
        age==other.age;
}

Any time you override equals you must override hashCode. Furthermore, hashCode can't use any more fields in its computation than equals did. Most of the time you must add or exclusive-or the hash code of the various fields (hashCode should be fast to compute). So a valid hashCode method might look like:

public int hashCode() {
    return (name==null ? 17 : name.hashCode()) ^ age;
}

Note that the following is not valid as it uses a field that equals didn't (height). In this case two "equals" objects could have a different hash code.

public int hashCode() {
    return (name==null ? 17 : name.hashCode()) ^ age ^ height;
}

Also, it's perfectly valid for two non-equals objects to have the same hash code:

public int hashCode() {    
    return age;    
}

In this case Jane age 30 is not equal to Bob age 30, yet both their hash codes are 30. While valid this is undesirable for performance in hash-based collections.

Steve Kuo
+5  A: 

Another question asks if there are some basic low-level things that all programmers should know, and I think hash lookups are one of those. So here goes.

A hash table (note that I'm not using an actual classname) is basically an array of linked lists. To find something in the table, you first compute the hashcode of that something, then mod it by the size of the table. This is an index into the array, and you get a linked list at that index. You then traverse the list until you find your object.

Since array retrieval is O(1), and linked list traversal is O(n), you want a hash function that creates as random a distribution as possible, so that objects will be hashed to different lists. Every object could return the value 0 as its hashcode, and a hash table would still work, but it would essentially be a long linked-list at element 0 of the array.

You also generally want the array to be large, which increases the chances that the object will be in a list of length 1. The Java HashMap, for example, increases the size of the array when the number of entries in the map is > 75% of the size of the array. There's a tradeoff here: you can have a huge array with very few entries and waste memory, or a smaller array where each element in the array is a list with > 1 entries, and waste time traversing. A perfect hash would assign each object to a unique location in the array, with no wasted space.

The term "perfect hash" is a real term, and in some cases you can create a hash function that provides a unique number for each object. This is only possible when you know the set of all possible values. In the general case, you can't achieve this, and there will be some values that return the same hashcode. This is simple mathematics: if you have a string that's more than 4 bytes long, you can't create a unique 4-byte hashcode.

One interesting tidbit: hash arrays are generally sized based on prime numbers, to give the best chance for random allocation when you mod the results, regardless of how random the hashcodes really are.

Edit based on comments:

1) A linked list is not the only way to represent the objects that have the same hashcode, although that is the method used by the JDK 1.5 HashMap. Although less memory-efficient than a simple array, it does arguably create less churn when rehashing (because the entries can be unlinked from one bucket and relinked to another).

2) As of JDK 1.4, the HashMap class uses an array sized as a power of 2; prior to that it used 2^N+1, which I believe is prime for N <= 32. This does not speed up array indexing per se, but does allow the array index to be computed with a bitwise AND rather than a division, as noted by Neil Coffey. Personally, I'd question this as premature optimization, but given the list of authors on HashMap, I'll assume there is some real benefit.

kdgregory
There's no reason why the buckets have to be linked lists. They could could also be, say, an array or a tree.With good hash codes (possibly rehashed) there is little advantage in prime sized arrays. Power of two sized arrays are faster to index and easier to calculate the next size.
Tom Hawtin - tackline
(cont'd) There are also probing arrays (for instance used in the Sun version of ThreadLocal and IdentityHashMap). Instead of having buckets, after a collision other entries in the array are probed using some algorithm.
Tom Hawtin - tackline
True on both counts. There are many ways to implement the "bucket list", I felt that the linked list was easiest to understand. As for non-prime arrays, the key is "good hash codes," which are not necessarily the case for general-purpose programming environments.
kdgregory
Java's hash maps DON'T USE A PRIME NUMBER for the number of buckets: they use a power of two, allowing an AND operations instead of a division when calculating the bucket number. They compensate for the danger of hash codes with non-random low bits by mixing high bits of the hash code with low ones.
Neil Coffey
+1  A: 

In general the hash code cannot be unique, as there are more values than possible hash codes (integers). A good hash code distributes the values well over the integers. A bad one could always give the same value and still be logically correct, it would just lead to unacceptably inefficient hash tables.

Equal values must have the same hash value for hash tables to work correctly. Otherwise you could add a key to a hash table, then try to look it up via an equal value with a different hash code and not find it. Or you could put an equal value with a different hash code and have two equal values at different places in the hash table.

In practice you usually select a subset of the fields to be taken into account in both the hashCode() and the equals() method.

starblue
+2  A: 

Joshua Bloch explains it best in chapter 3 of his "Effective Java".

duffymo
A: 

The only contractual obligation for hashCode is for it to be consistent. The fields used in creating the hashCode value must be the same or a subset of the fields used in the equals method. This means returning 0 for all values is valid, although not efficient.

One can check if hashCode is consistent via a unit test. I written an abstract class called EqualityTestCase, which does a handful of hashCode checks. One simply has to extend the test case and implement two or three factory methods. The test does a very crude job of testing if the hashCode is efficient.

brianegge
A: 

This is what documentation tells us as for hash code method

@ javadoc

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

M. Atif Riaz