ansaurus

Question

Answer 1

+3 A:

Here's a great explanation: http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

duduamar 2010-08-31 20:50:22

Can you summarize the content of that post? If it were to go away at some point, this answer becomes useless.

Thomas Owens 2010-08-31 20:52:03

I'd say the explanation is extremely weak. It makes a number of dubious statements, such as implying the hash codes are unique and that, "Given a string 'Samuel', you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up."

erickson 2010-08-31 20:59:27

It's a nice read but comes short of actually explaining why a prime such as 31 is better than a non-prime such as 42. But then again, do the test, the difference between a prime and just any odd number is often too small to be measured.

Pascal Cuoq 2010-08-31 21:02:43

Answer 2

A:

I heard that 31 was chosen so that the compiler can optimize the multiplication to left-shift 5 bits then subtract the value.

Steve Kuo 2010-08-31 21:11:52

how could the compiler optimize that way? x*31==x*32-1 isn't true for all x afterall. What you meant was left shift 5 (equals multiply by 32) and then subtract the original value (x in my example). While this might be faster then a multiplication (it probaly isn't for modern cpu processors by the way), there are more important factors to consider when choosing a multiplication for a haschcode (equal distribution of input values to buckets comes to mind)

Grizzly 2010-08-31 22:52:07

Do a bit of searching, this is a pretty common opinion.

Steve Kuo 2010-08-31 22:54:28

Answer 3

A:

It generally helps achieve a more even spread of your data among the hash buckets, especially for low-entropy keys.

fennec 2010-08-31 21:13:47

Answer 4

A:

Here's a citation a little closer to the source.

It boils down to:

31 is prime, which reduces collisions
31 produces a good distribution, with
a reasonable tradeoff in speed

John at CashCommons 2010-08-31 21:15:55

Answer 5

+8 A:

Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.

Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.

31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).

ILMTitan 2010-08-31 21:30:10

What if there's 31 or 62 buckets (or some multiple of 31) then?

Steve Kuo 2010-08-31 21:37:58

Then a hash function that multiplies by 31 will perform non-optimally. However, I would consider such a hash table implementation poorly designed, given how common 31 as a multiplier is.

ILMTitan 2010-08-31 21:42:48

So 31 is chosen based on the assumption that hash table implementors know that 31 is commonly used in hash codes?

Steve Kuo 2010-08-31 21:50:08

31 is chosen based on the idea that most implementations have factorizations of relatively small primes. 2s, 3s and 5s usually. It may start at 10 and grow 3X when it gets too full. The size is rarely entirely random. And even if it were, 30/31 are not bad odds for having well synced hash algorithms. It may also be easy to calculate as others have stated.

ILMTitan 2010-08-31 21:55:28

Answer 6

+7 A:

Prime numbers are chosen to best distribute data among hash buckets. If the distribution of inputs is random and evenly spread, then the choice of the hash code/modulus does not matter. It only has an impact when there is a certain pattern to the inputs.

This is often the case when dealing with memory locations. For example, all 32-bit integers are aligned to addresses divisible by 4. Check out the table below to visualize the effects of using a prime vs. non-prime modulus:

Input       Modulo 8    Modulo 7
0           0           0
4           4           4
8           0           1
12          4           5
16          0           2
20          4           6
24          0           3
28          4           0

Notice the almost-perfect distribution when using a prime modulus vs. a non-prime modulus.

However, although the above example is largely contrived, the general principle is that when dealing with a pattern of inputs, using a prime number modulus will yield the best distribution.

advait 2010-08-31 21:38:27

Aren't we talking about the multiplier used to generate the hash code, not the modulo used to sort those hash codes into buckets?

ILMTitan 2010-08-31 21:50:28

Answer 7

+2 A:

For what it's worth, Effective Java 2nd Edition hand-waives around the mathematics issue and just say that the reason to choose 31 is:

Because it's an odd prime, and it's "traditional" to use primes
It's also one less than a power of two, which permits for bitwise optimization

Here's the full quote, from Item 9: Always override hashCode when you override equals:

The value 31 was chosen because it's an odd prime. If it were even and multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional.

A nice property of 31 is that the multiplication can be replaced by a shift (§15.19) and subtraction for better performance:
 31 * i == (i << 5) - i
Modern VMs do this sort of optimization automatically.

While the recipe in this item yields reasonably good hash functions, it does not yield state-of-the-art hash functions, nor do Java platform libraries provide such hash functions as of release 1.6. Writing such hash functions is a research topic, best left to mathematicians and theoretical computer scientists.

Perhaps a later release of the platform will provide state-of-the-art hash functions for its classes and utility methods to allow average programmers to construct such hash functions. In the meantime, the techniques described in this item should be adequate for most applications.

Rather simplistically, it can be said that using a multiplier with numerous divisors will result in more hash collisions. Since for effective hashing we want to minimize the number of collisions, we try to use a multiplier that has fewer divisors. A prime number by definition has exactly two distinct, positive divisors.

ansaurus

tags:

views:

answers:

Why use a prime number in hashCode?

Related questions

related questions