views:

6735

answers:

7

In Java, the hash code for a String object is computed as

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation.

Why is 31 used as a multiplier?

I understand that the multiplier should be a relatively large prime number. So why not 29, or 37, or even 97?

+2  A: 

I'm not sure, but I would guess they tested some sample of prime numbers and found that 31 gave the best distribution over some sample of possible Strings.

Dave L.
+13  A: 

On (mostly) old processors, multiplying by 31 can be relatively cheap. On an ARM, for instance, it is only one instruction:

RSB       r1, r0, r0, ASL #5    ; r1 := - r0 + (r0<<5)

Most other processors would require a separate shift and subtract instruction. However, if your multiplier is slow this is still a win. Modern processors tend to have fast multipliers so it doesn't make much difference, so long as 32 goes on the correct side.

It's not a great hash algorithm, but it's good enough and better than the 1.0 code (and very much better than the 1.0 spec!).

Tom Hawtin - tackline
Funny enough, the multiplication with 31 is on my desktop machine actually a little bit slower than multiplication with, say, 92821. I guess the compiler tries to "optimize" it into shift and add as well. :-)
hstoerr
+29  A: 

According to Joshua Bloch's Effective Java (a book that can't be recommended enough, and which I bought thanks to continual mentions on stackoverflow):

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

(from Chapter 3, Item 9: Always override hashcode when you override equals, page 48)

matt b
Well all primes are odd, except 2. Just sayin.
Kip
I don't think Bloch is saying it was chosen because it was an odd prime, but because it was odd AND because it was prime (AND because it can easily be optimized into a shift/subtract).
matt b
Yes brilliant book!
Mark
31 was chosen coz it is an odd prime??? That doesnt make any sense - I say 31 was chosen because it gave the best distribution - check http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/
I think the choice of 31 is rather unfortunate. Sure, it might save a few CPU cycles on old machines, but you have hash collisions already on short ascii strings like "@ and #! , or Ca and DB . This does not happen if you choose, for instance, 1327144003, or at least 524287 which also allows bitshift: 524287 * i == i << 19 - i.
hstoerr
But why multiply it just pushes the low order bits from the item to the left, after enough items all your bits are pushed left and overflow. perhaps if they underflowed then things might be better.
mP
@hstoerr: I'd like to see your math on that. Even if you're right (which you most likely are for the 2 digit example), I think if you look at how hashes are used in Java, it doesn't really hurt things very much for there to be collisions in very short strings. They're not used for keys all that often.
Jason
@Jason See my answer http://stackoverflow.com/questions/1835976/what-is-a-sensible-prime-for-hashcode-calculation/2816747#2816747 . My point is: you get much less collisions if you use a larger prime, and lose nothing these days. The problem is worse if you use non-english languages with common non-ascii chars. And 31 served as a bad example for many programmers when writing their own hashCode functions.
hstoerr
+15  A: 

As Goodrich and Tamassia point out, If you take over 50,000 English words (formed as the union of the word lists provided in two variants of Unix), using the constants 31,33,37,39, and 41 will produce less than 7 collisions in each case. Knowing this, it should come as no surprise that many Java implementations choose one of these constants.

Ironically, I was in the middle of reading the section "polynomial hash codes" when I saw this question.

jJack
That's coincidence, not irony :)
Grandpa
Note however that you might get WAY more collisions if you use any kind of international charset with common characters outside the ASCII range. At least, I checked this for 31 and German. So I think the choice of 31 is broken.
hstoerr
+2  A: 

By multiplying, bits are shifted to the left. This uses more of the available space of hash codes, reducing collisions.

By not using a power of two, the lower-order, rightmost bits are populated as well, to be mixed with the next piece of data going into the hash.

The expression n * 31 is equivalent to (n << 5) - n.

erickson
+1  A: 

Bloch doesn't quite go into this, but the rationale I've always heard/believed is that this is basic algebra. Hashes boil down to multiplication and modulus operations, which means that you never want to use numbers with common factors if you can help it. In other words, relatively prime numbers provide an even distribution of answers.

The numbers that make up using a hash are typically:

  • modulus of the data type you put it into (2^32 or 2^64)
  • modulus of the bucket count in your hashtable (varies. In java used to be prime, now 2^n)
  • multiply or shift by a magic number in your mixing function
  • The input value

You really only get to control a couple of these values, so a little extra care is due.

Jason
A: 

Minor comment, does 31 work best for distribution because there are 26 letters in the English alphabet. What happens if your alphabet has a lot more symbols or glyphs would there be a need to have a higher number replacing 31 ? My guess is one picked a number less than 31 then because there are 26 letters there would be a lot more clashes.

I always wondered why 63 was not chosen given there are 26 letters, 10 digits, common punctuation etc. I guess the best answer to that is the example above where 31 was replaced with other values and clashes rechecked. I wonder if the word sample was mostly letters only, or a true mixture of some sample text from a book etc. My guess is the style for most String keys that go inside a Map end up being letters only, so using 31 because its greater tahn 26 makes sense.

mP