views:

130

answers:

5

I have a range of objects that have a long field whose value uniquely identifies a particular object across my entire system, much like a GUID. I have overriden Object.equals() to use this id for comparison, beause I want it to work with copies of the object. Now I want to override Object.hashCode(), too, which basically means mapping my long to some int return value.

If I understood the purpose of hashCode correctly, it is mainly used in hash tables, so a uniform distribution would be desirable. This would mean, simply returning id % 2^32 would suffice. Is that all, or should I be aware of something else?

+3  A: 

You have understood the purpose of hashCode correctly. Yes, an uniform distribution is desirable (although not an actual requirement).

I would suggest ((id >> 32) ^ id).

The above expression:

  • Uses all bits of the original value, does not discard any information upfront. For example, depending on how you are generating the IDs, the upper bits could change more frequently (or the opposite).
  • Does not introduce any bias towards values with more ones (zeros), as it would be the case if the two halves were combined with an OR (AND) operation.
Grodriguez
You might want to provide an explanation.
mikerobi
I though it was self-explanatory, but I'll give it a try. Updating answer.
Grodriguez
+1. This is almost the hashCode defined for `java.lang.Long`, although it uses `>>>` instead of `>>`. I wonder whether `(new Long(id)).hashCode();`, or similar, would be optimized well.
Steve Jessop
@Lucero: If they were identities, then certainly this would not add much to the original solution, however OP didn't state that the original values were identities in his question. This solution is more general.
Grodriguez
@Steve: There is no difference between `>>>` and `>>` in this case, since the extra 32 bits that are introduced during the shift will be discarded anyway.
Grodriguez
A: 
int result = (int)((longVal >> 32) ^ longVal);

will be more well distributed, because modulo will not return different value if only upper bits of your long value has changed.

codymanix
Why are you only keeping the lower 16 bits of `longVal` in the second argument to the XOR ? (btw you don't need a mask anyway. You do need a cast, though).
Grodriguez
yes you are right, I changed that
codymanix
+6  A: 

how about

Long.valueOf(guid).hashCode();

You could return it, or do it when you create the guid and cache it in a variable.

Looking at the docs this does:

(int)(this.longValue()^(this.longValue()>>>32))

This is a decent solution since it makes use of the Java library - always better to leverage off of something that has been tested already.

TofuBeer
This can be expensive as it requires object creation (hence the Guava alternative). As for the algorithm itself, the only time it's dangerous is when the upper and lower 32 bits have correlated meanings. For example, this would be a horrible hashcode for a `Point` class who stores a 32-bit x and y coord in a single long.
Mark Peters
It is entirely possible for the VM to optimize out the object creation. Not that I would want to rely on that.
TofuBeer
@Mark actually it would work just fine for a Point class as well. in fact, if i had a point class with separate x,y ints, i would generate the hashCode in a similar fashion (x^y).
james
@james: That's an awful design decision though, because something like a point typically doesn't have a uniform distribution over all its bitwidth. For example, if you know that all points will be between 0 and 1023 in the x and y axis, you will get something like a 99.8% collision rate on that hash. There are 1024*1024 possible combinations being reduced into 10 bits of hash.
Mark Peters
@TofuBeer: Actually I'd guess it's **impossible** for the VM to optimize that out (though I wonder if a compiler could legitimately optimize it out). Do you have a source/JLS/JVM spec link? I'd be really interested on this one. Java developers (myself included typically) frequently delegate optimization concerns to the VM when in fact the spec prevents those optimizations from even taking place. Plus we often rely on optimizations that might be possible but don't happen in practice.
Mark Peters
http://download.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html not exactly the same thing, but in the same area. I didn't say that any VMs actually do it, just that it should be possible. There is nothing I am aware of in any spec that would say that the VM could not optimize out the object creation and inline the code. Generally compilers (javac type) are stupid when it comes to optimizing.
TofuBeer
@Mark: a recent Sun/Oracle JVM **will** optimize this if used with the `-XX:+DoEscapeAnalysis` flag. The JVM will be able to notice that the `Long` instance does not need to live outside of the statement, and thus can be created on the stack. Moreover, it will be able to inline the code (it will notice that we talk about `Long` and not another class which extends `Long`). Stack-based creation and inlining allow for the full host of further optimizations that the JVM knows about.
Thomas Pornin
@TofuBeer and @Thomas: good reads, thanks a bunch! It seems like DoEscapeAnalysis will optimize a long way since it would negate the need for allocation on the heap (though it is still "creating" an object on the stack). I'd imagine this would get within the same order of magnitude as dealing with primitives directly.
Mark Peters
@Mark not sure i'm following the calculation you used to determine the "99.8%" collision rate.
james
@james Not sure about the 98.8% stat (that depends on how many insertions you do), but look at this as a 'birthday paradox' problem (http://en.wikipedia.org/wiki/Birthday_problem). With only 1024 possible values in Mark's example, there's 50% chance of a collision after just 38 hashes of randomly-chosen values, 99% after just 98. This is why hashing is so difficult: for performance-sensitive hashing, you need to either understand your data + hash accordingly, or in the (even harder) case of general-purpose libraries, anticipate likely usages + mix the bits well.
Cowan
@Cowan and @james: I posted an answer to address this: http://stackoverflow.com/questions/4045063/how-should-i-map-long-to-int-in-hashcode/4054454#4054454. It's too difficult to do in a comment. Cowan is dead on though, and his numbers agree with mine that 38 items results in a 50% chance of collision. Regarding the 99.8% stat I used the wrong name. Collision rate was a bad description, I meant rather that 99.8% of the range of a perfect hash (1024*1024-1024) goes unused.
Mark Peters
Real world scenario: say you were using the Point class to implement a sparse matrix (1024, 1024) based on a `HashMap<Point, T>`. If you used this hashcode, and the matrix was only sparsely populated (say 50000 elements out of 1024^2), every element you add would have a 97.9% chance of colliding with another element in the matrix.
Mark Peters
+3  A: 

It's a bit of a minor thing if you're not using Guava already, but Guava can do this for you nicely:

public int hashCode() {
  return Longs.hashCode(id);
}

That gives you the equivalent of Long.valueOf(id).hashCode():

return (int) (value ^ (value >>> 32));

Additionally, if you were to have other values or objects that were part of the hashcode, you could just write

return Objects.hashCode(longValue, somethingElse, ...);

The long would be autoboxed into a Long so you'd get the correct hashcode for it as part of the overall hashcode.

ColinD
I wouldn't pull in a whole new library just for this, but I had never heard of Guava, it seems quite helpful and worth looking at from a more general point of view. Thanks!
Hanno Fietz
@Hanno: Yeah, it certainly wouldn't be worth it just for this little thing. But it's a great library with tons of useful features!
ColinD
A: 

(l >> 32) ^ l is a good hashcode in most cases; particularly when the long has a uniform distribution.

Since it was the accepted answer, I'm posting this to clarify some of my comments about when it's NOT a good hashcode for a long.

The example I gave was a Point class like this:

public class Point {
    private final long coords; //x in high-bits, y in low
    public int getX() {
        return (int)(coords >> 32);
    }
    public int getY() {
        return (int)coords;
    }
    public int hashCode() {
        return (int)((coords >> 32) ^ (coords));
    }
}

It may seem contrived, but occasionally you have multiple "fields" packed into a long.

So the coords field represents 32 bits of x and 32 bits of y. So why is this a problem? Well, it's not if each of x and y are evenly distributed over their respective 32 bits. But that's unlikely in practice. What is more likely is that X and Y are bounded by some number. Let's say 1024 since it's 2^10. This means that at most the lower 10 bits of each X and Y are set:

00000000 00000000 000000XX XXXXXXXX 00000000 00000000 000000YY YYYYYYYY

There are 2^20 (1024*1024) possible combinations. But what's the operation hashCode is doing?

  00000000 00000000 000000XX XXXXXXXX 
^ 00000000 00000000 000000YY YYYYYYYY
-------------------------------------
= 00000000 00000000 000000?? ????????

There are at most 2^10 (1024) possible hashCode values since only the lower 10 bits can ever be anything other than zero. The ratio of hash values to real values is 1024:(1024*1024) or 1:1024. So right off the bat there is a 1/1024 probability that two numbers have the same hash.

Now let's calculate the probability of a collision by applying math from the birthday problem. Let p(n) be the probability that with n values there will be at least one collision. We know that p(1025+) = 1 since there are only 1024 values.

p(n) = 1 - (n! * (1024 choose n))/1024^n

This works out to the following:

n: p(n)
1: 0.00000
2: 0.00098
3: 0.00293
4: 0.00585
5: 0.00973
6: 0.01457
...
38: 0.50096
...
79: 0.95444
...
148: 0.99999

With just 38 items, there is probably a collision. With 148 items, there is a 99.999% chance of (at least one) collision. With 148 items, each item has a 7% chance of colliding with another item. With a proper hashing function, taking knowledge of the domain, these numbers could easily go down to 0.

In other words, knowing your domain and how things happen in practice are key to making a performant hash. Library functions try to do as good a job as possible knowing nothing about your domain, and to be performant typically rely on a distribution of data that won't occur in practice.

Mark Peters
Ultimately, this answer is orthogonal to my original statement that using x^y for a Point class is a reasonable hash. Your argument here is that it is not reasonable +if+ x and y are limited to max 1024. valid point, but does not contradict my original statement.
james
@james: That's just being unnecessarily ignorant though, is my point. How often in practice is a set of Points uniformly distributed over its domain? Almost never. There's a reason Bloch suggests this type of recipe for hashCode: `somePrime * getX() + getY()`. It's not great, but the prime is there to try to "uncorrelate" data without knowing anything about the domain. That's also how the real `Point2D` class works, in general.
Mark Peters
@james: by the way, this is just as relevant for x and y being bounded by 2^30 as well, though for 2^30 you'd expect a ton of collisions anyway; there's nothing you can do about that. 1024 was chosen simply because it's easy to explain.
Mark Peters