views:

209

answers:

4

I am wondering about the parameters for constructing a ConcurrentHashMap:

  • initialCapacity is 16 by default (understood).
  • loadFactor is 0.75 by default.
  • concurrencyLevel is 16 by default.

My questions are:

  • What criteria should be used to adjust loadFactor up or down?
  • How do we establish the number of concurrently updating threads?
  • What criteria should be used to adjust concurrencyLevel up or down?

Additionally:

  • What are the hallmarks of a good hashcode implementation? (If an SO question addresses this, just link to it.)

Thank you!

A: 

loadFactor: controls when the implementation decides to resize the hashtable. Too high a value will waste space; too low a value will result in expensive resize operations.

concurrencyLevel: tells the implementation to try to optimize for the given number of writing threads. According to the API docs, being off by up to a factor of 10 shouldn't have much effect on performance.

The allowed concurrency among update operations is guided by the optional concurrencyLevel constructor argument (default 16), which is used as a hint for internal sizing. The table is internally partitioned to try to permit the indicated number of concurrent updates without contention. Because placement in hash tables is essentially random, the actual concurrency will vary. Ideally, you should choose a value to accommodate as many threads as will ever concurrently modify the table. Using a significantly higher value than you need can waste space and time, and a significantly lower value can lead to thread contention. But overestimates and underestimates within an order of magnitude do not usually have much noticeable impact.

A good hashcode implementation will distribute the hash values uniformly over any interval. If the set of keys is known in advance it is possible to define a "perfect" hash function that creates a unique hash value for each key.

Jim Garrison
A: 

loadFactor is set to 0.75 by default, what criteria should be used to adjust this up or down?

You need some background in how hash maps work before you can understand how this works. The map is essentially a series of buckets. Each value in the map gets put in to a bucket depending on what its hash code is. The loadFactor means, if the buckets are more than 75% full, the Map should be resized

concurrencyLevel is set to 16 by default, how do we establish the number of concurrently updating threads? What criteria should be used to adjust this up or down?

This is asking how many threads to you expect to modify the Map concurrently (simultaneously)

For hash codes, see Joshua Bloch's Effective Java

Kevin
A: 

Load Factor is primarily related to the quality of the hash function. The closer to zero the load factor the less likely there are to be collisions even if the hash function isn't so great. The trade off is that the memory footprint is larger. In other words, the HashMap isn't distributing the entries in seperate buckets for each seperate hashcode, it is grouping them by a proximity, so the more buckets it has, the more spread out the distribution, the less likely that there are collisions.

So the bottom line is you fiddle with load factor to improve lookup time or reduce memory, according to your needs and the objects you are storing in the Map.

ConcurrencyLevel really depends on your application. If you only have two or three threads running in the application, there you go. If you are an application server with an arbitrary number of threads, then you need to understand what your load capacity is and what point you want to optimize for.

A good quality hashcode implementation provides as wide a distribution across potential values of the object as possible with the least number of collisions, while honoring the contract. In other words, it allows the HashMap (or Set as the case may be) to distribute the objects into separate buckets making lookups faster.

Yishai
+1  A: 

The short answer: set "initial capacity" to roughly how many mappings you expect to put in the map, and leave the other parameters at their default.

Long answer:

  • load factor is the ratio between the number of "buckets" in the map and the number of expected elements;

  • 0.75 is usually a reasonable compromise-- as I recall, it means that with a good hash function, on average we expect about 1.6 redirects to find an element in the map (or around that figure);

    • changing the load factor changes the compromise between more redirects to find an element but less wasted space-- put 0.75 is really usually a good value;

    • in principle, set ConcurrencyLevel to the number of concurrent threads you expect to have modifying the map, although overestimating this doesn't appear to have a bad effect other than wasting memory (I wrote a little on ConcurrentHashMap performance a while ago in case you're interested)

Informally, your hash function should essentially aim to have as much "randomness" in the bits as possible. Or more strictly, the hash code for a given element should give each bit a roughly 50% chance of being set. It's actually easier to illustrate this with an example: again, you may be interested in some stuff I wrote about how the String hash function works and associated hash function guidelines. Feedback is obvioulsy welcome on any of this stuff.

One thing I also mention at some point is that you don't have to be too paranoid in practice: if your hash function produces a "reasonable" amount of randomness in some of the bits, then it will often be OK. In the worst case, sticking representative pieces of data into a string and taking the hash code of the string actually doesn't work so badly.

Neil Coffey