ansaurus

Question

Java HashMap performance optimization / alternative

Answer 1

+8 A:

26 million entries? Sounds like the data actually belongs in a database. I would then recommend to use a decent RDBMS with a proper datamodel and just use JDBC/SQL to obtain the specific entries from it.

BalusC 2009-11-18 16:49:27

Not an answer to the actual question, but probably the best recommendation to do. I'm actually impressed only thinking about a single hashmap with so many values.

Gnoupi 2009-11-18 16:56:49

That is not an option in my case because I want to performance of an in memory HashMap and the memory used is not an issue.

Nash0 2009-11-18 17:08:48

@Gnoupi: "Other data structure suggestions are welcome" - clearly this is an answer to the OP.

280Z28 2009-11-18 17:13:48

@Nash0: Obviously you *don't* want the performance of an in-memory HashMap or this post wouldn't exist.

280Z28 2009-11-18 17:14:25

You are right, I want "fast in memory lookup" performance.

Nash0 2009-11-18 17:19:19

@Nash0: You could use an in-memory HSQLDB or SQLite database, or use any other database but configure it to use a large amount of memory. For a drop-in replacement, you could use a Map implementation which is backed by the database.

rob 2009-11-18 17:22:52

It's simply a pain, that much entries in a `Map`. It also insinuates that you aren't using a numbered sequence as key (like as in decent datamodels and in `List`) and that is really a performance pain.

BalusC 2009-11-18 17:24:26

Apart from embedded RDBMS suggestions as HSQLDB/SQLite you could also take benefit of JDK's built-in JavaDB.

BalusC 2009-11-18 17:35:11

This doesn't answer the question.

OscarRyz 2009-11-19 12:56:07

To paraphrase oxbow_lakes: "[your favorite] DB is nowhere near fast enough for this number of entries due to the serialization/IO overhead; it could never be faster than a hashmap and the OP doesn't care about persistence. Your suggestion is not a good one."

ckeh 2010-09-20 23:19:43

@ckeh: only if you have enough RAM, which is in practice unlikely with 26 million entries.

BalusC 2010-09-20 23:25:22

"only if you have enough RAM, which is in practice unlikely with 26 million entries." - Perhaps... but that is speculation. Note that the OP didn't encounter an issue with RAM in practice, as per the answer: http://stackoverflow.com/questions/1757363/java-hashmap-performance-optimization-alternative/1761232#1761232

ckeh 2010-09-20 23:45:32

It would be quite interesting if the OP posted information about RAM usage, etc, since a number of answers assumed that is too much data for an in memory solution while the OP's results indicate that was not the case.

ckeh 2010-09-20 23:54:07

Answer 2

+3 A:

HashMap has initial capacity and HashMap's performance very very depends on hashCode that produce underlying objects.

Try to tweak both.

Mykola Golubyev 2009-11-18 16:49:35

Answer 3

+1 A:

Are you specifying the proper capacity and load factor in the map constructor? Did you implement the hashCode method of the objects you are inserting?

GregB 2009-11-18 16:51:18

Yes I implemented the `hashcode()` method and it satisfies the requirement that equal objects have the same hashcode.

Nash0 2009-11-18 17:11:50

That's a necessary but not sufficient condition for a good hashCode. You also want different inputs to produce different outputs as often as possible.

Jay 2009-11-18 17:40:45

Answer 4

+5 A:

My first idea is to make sure you're initializing your HashMap appropriately. From the JavaDocs for HashMap:

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

So if you're starting off with a too-small HashMap, then every time it needs to resize, all the hashes are recomputed... which might be what you're feeling when you get to the 2-3 million insertion point.

delfuego 2009-11-18 16:52:21

I don't think they are recomputed, ever. The table size is increased, the hashes are kept.

Henning 2009-11-18 16:57:37

Hashmap just does a bit-wise and for every entry: newIndex = storedHash

Henning 2009-11-18 17:02:42

Hanning: Perhaps poor wording on delfuego's part, but the point is valid. Yes, the hash values are not recomputed in the sense that the output of hashCode() is not recalculated. But When the table size is increased, all the keys must be re-inserted into the table, that is, the hash value must be re-hashed to get a new slot number in the table.

Jay 2009-11-18 17:03:53

Jay, yep -- poor wording indeed, and what you said. :)

delfuego 2009-11-18 17:05:35

Jay: But that's a relatively cheap operation, and as the size doubles every time, this can under no circumstances explain poor performance.

Henning 2009-11-18 17:05:46

The strange thing is that setting the inital capacity equal to the number of elements I want to insert actually *decreases* the performance.

Nash0 2009-11-18 17:10:23

@Henning: Well, the AND function itself is cheap, but you do have to loop through all the slots in the table, then chase the linked list on every slot, and then for each entry do the AND (plus a "length-1" -- seems to me they should have cached the subtraction, but whatever). None of that is expensive of itself, but doing it for millions of entries could add up. I haven't done any benchmarks, but I'd think it's worth looking at.

Jay 2009-11-18 17:34:29

@Jay: I'm not against properly initializing the capacity, quite to the contrary. But if you start with the default capacity (16), you will only have to increase about 20 times for 26 million entries. That's a few million AND and copy operations total, sure, but could that explain a _significant_ drop in performance to the point where it's noticeable, let alone unbearable? The bad hashCode() scenario you describe in your answer sounds much better.

Henning 2009-11-18 17:42:30

@delfuego and @nash0: Yeap, setting the initial capacity equal to the number of elements decreases the performance because you're having tons of millions of collisions and thus you're only using a small amount of that capacity. *Even* if you use all the available entries, setting the same capacity will make it worst!, because due to the load factor more space will be requested. You'll have to use `initialcapactity = maxentries/loadcapacity` ( such as 30M ,0.95 for 26M entries ) but this is **NOT** your case, since you're having all those collisions you're using only about 20k or less.

OscarRyz 2009-11-18 22:06:04

Good point Oscar, it is the hash method that needs to be fixed.

Nash0 2009-11-18 22:40:20

Answer 5

A:

Allocate a large map in the beginning. If you know it will have 26 million entries and you have the memory for it, do a new HashMap(30000000).

Are you sure, you have enough memory for 26 million entries with 26 million keys and values? This sounds like a lot memory to me. Are you sure that the garbage collection is doing still fine at your 2 to 3 million mark? I could imagine that as a bottleneck.

ReneS 2009-11-18 17:07:12

Oh, another thing. Your hash codes have to be evenly distributed to avoid large linked lists at single positions in the map.

ReneS 2009-11-18 17:08:04

Answer 6

A:

OscarRyz 2009-11-18 17:07:17

Oscar, as pointed out elsewhere above (in response to your comments), you seem to be assuming that more collisions is GOOD; it's very much NOT good. A collision means that the slot at a given hash goes from containing a single entry to containing a list of entries, and this list has to be searched/traversed every time the slot is accessed.

delfuego 2009-11-18 21:16:45

@delfuego: Not really, that happens only when you have a collision using different classes but for the same class the same entry is used ;)

OscarRyz 2009-11-18 21:41:03

@delfuego: See for yourself: http://pastebin.com/f20af40b9 So, in this question the same class is being used ( wait a minute, the same class is being used right? ) Which implies that when the same hash is used the same entry is used and there is not "list" of entries.

OscarRyz 2009-11-18 21:48:54

Actually I am adding 26 millions items, the number of items is (52 choose 5) * (5 choose 3) =~ 26 million. I am sorry but my loop is not doing much besides calling `put()`, it is the HashMap performance.

Nash0 2009-11-18 22:30:08

I think I see what you mean by "not storing 26 millions objects", I am storing 26 million but they are only going in 20k hash buckets. From my experimentation that it true.

Nash0 2009-11-18 22:43:22

@Nash0: MMhh I see, yeah you should know you code much better. I just can tell about my findings. I think you should definitely take a profiler, and let it tell you where the problem is. I still think it is not the hash map, but of course I don't have your code. Here's the code for my test if you find them useful. http://pastebin.com/f7a6cd5a5 Using your hashcode for 26M with no initial capacity runs in 8s

OscarRyz 2009-11-18 22:45:24

That's the number I've got also 20K buckets, they are not really being stored.. wait a minute? Are you using something else than ram? Because if your serializing somehow that's where the performance is going to.

OscarRyz 2009-11-18 22:47:14

@Oscar - see my response to you with MAK's answer. HashMap maintains a linked list of entries at each hash bucket, and walks that list calling equals() on every element. The object's class has nothing to do with it (other than a short-circuit on equals()).

kdgregory 2009-11-19 00:53:18

@Oscar - Reading your answer it seems that you are assuming that equals() will return true if the hashcodes are the same. This is not part of the equals/hashcode contract. If I've misunderstood, ignore this comment.

kdgregory 2009-11-19 00:55:08

Thank you very much for the effort Oscar but I think you are confusing the key objects being equal vs having the same hash code. Also in one of your code links you are using equals strings as the key, remember that strings in Java are immutable. I think we both learned a lot about hashing today :)

Nash0 2009-11-19 05:52:53

@Nash0: OOoook, yeap, we learned a lot :) Now I got it, so you're returning the same hash for objects that are NOT equals, right? Yeap I miss that point. The javadoc says: *It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hashtables.*http://java.sun.com/javase/6/docs/api/java/lang/Object.html#hashCode()

OscarRyz 2009-11-19 12:54:32

Answer 7

+6 A:

To elaborate on Pascal: Do you understand how a HashMap works? You have some number of slots in your hash table. The hash value for each key is found, and then mapped to an entry in the table. If two hash values map to the same entry -- a "hash collision" -- HashMap builds a linked list.

Hash collisions can kill the performance of a hash map. In the extreme case, if all your keys have the same hash code, or if they have different hash codes but they all map to the same slot, then your hash map turns into a linked list.

So if you're seeing performance problems, the first thing I'd check is: Am I getting a random-looking distribution of hash codes? If not, you need a better hash function. Well, "better" in this case may mean "better for my particular set of data". Like, suppose you were working with strings, and you took the length of the string for the hash value. (Not how Java's String.hashCode works, but I'm just making up a simple example.) If your strings have widely varying lengths, from 1 to 10,000, and are fairly evenly distributed across that range, that this could be a very good hash function. But if your strings are all 1 or 2 characters, this would be a very bad hash function.

Edit: I should add: Every time you add a new entry, HashMap checks if this is a duplicate. When there's a hash collision, it has to compare the incoming key against every key that mapped to that slot. So in the worst case where everything hashes to a single slot, the second key is compared to the first key, the third key is compared to #1 and #2, the fourth key is compared to #1, #2, and #3, etc. By the time you get to key #1 million, you've done over a trillion compares.

@Oscar: Umm, I don't see how that's a "not really". It's more like a "let me clarify". But yes, it's true that if you make a new entry with the same key as an existing entry, that this overwrites the first entry. That's what I meant when I talked about looking for duplicates in the last paragraph: Whenever a key hashes to the same slot, HashMap must check if it's a duplicate of an existing key, or if they are just in the same slot by coincidence of the hash function. I don't know that that's the "whole point" of a HashMap: I would say that the "whole point" is that you can retrieve elements by key quickly.

But anyway, that doesn't affect the "whole point" that I was trying to make: When you have two keys -- yes, different keys, not the same key showing up again -- that map to the same slot in the table, HashMap builds a linked list. Then, because it has to check each new key to see if it is in fact a duplicate of an existing key, each attempt to add a new entry that maps to this same slot must chase the linked list examining each existing entry to see if this is a duplicate of a previously-seen key, or if it is a new key.

Jay 2009-11-18 17:10:49

Yup, a bad hash function would result in this kind of behavior. +1

Henning 2009-11-18 17:14:31

Not really. The list is created **only** if the hash is the same, but the key is **different**. For instance if a String give hashcode 2345 and and Integer gives the same hashcode 2345, then the integer is inserted into the list because `String.equals( Integer )` is `false`. **But** if you have the same class ( or at least `.equals` returns true ) then the same entry is used. For instance `new String("one")` and `new String("one") used as keys, will use the same entry. Actually this is the **WHOLE** point of HashMap in first place! See for yourself: http://pastebin.com/f20af40b9

OscarRyz 2009-11-18 22:15:23

@Oscar: See my reply appended to my original post.

Jay 2009-11-19 14:05:44

Answer 8

+2 A:

You can try to use an in-memory database like HSQLDB.

Adrian 2009-11-18 17:18:52

Answer 9

+1 A:

Have you considered using a embeded database to do this. Look at Berkeley DB. It is open-source, owned by Oracle now.

It stores everything as Key->Value pair, it is NOT an RDBMS. and it aims to be fast.

coolest_head 2009-11-18 17:27:31

Berkeley DB is nowhere near fast enough for this number of entries due to the serialization/IO overhead; it could never be faster than a hashmap and the OP doesn't care about persistence. Your suggestion is not a good one.

oxbow_lakes 2009-11-18 21:18:02

Answer 10

+1 A:

SQLite lets you use it in memory.

JRL 2009-11-18 17:29:17

Answer 11

+1 A:

First you should check that you are using Map correctly, good hashCode() method for keys, initial capacity for Map, right Map implementation etc. like many other answers describe.

Then I would suggest using a profiler to see what is actually happening and where the execution time is spent. Is, for example, hashCode() method executed for billions of times?

If that doesn't help, how about using something like EHCache or memcached? Yes, they are products for caching but you could configure them so that they will have enough capacity and will never evict any values from cache storage.

Another option would be some database engine that is lighter weight than full SQL RDBMS. Something like Berkeley DB, maybe.

Note, that I have personally no experience of these products' performance, but they could be worth the try.

Juha Syrjälä 2009-11-18 17:29:48

Answer 12

+3 A:

If the keys have any pattern to them then you can split the map into smaller maps and have a index map.

Example: Keys: 1,2,3,.... n 28 maps of 1 million each. Index map: 1-1,000,000 -> Map1 1,000,000-2,000,000 -> Map2

So you'll be doing two lookups but the key set would be 1,000,000 vs 28,000,000. You can easily do this with sting patterns also.

If the keys are completely random then this will not work

coolest_head 2009-11-18 17:32:49

Even if the keys are random, you could use (key.hashCode() % 28) to select a map where to store that key-value.

Juha Syrjälä 2009-11-18 18:02:15

Interesting, it could work.

Nash0 2009-11-18 20:16:52

Answer 13

A:

You could try to cache computed hash code to the key object.

Something like this:

public int hashCode() {
  if(this.hashCode == null) {
     this.hashCode = computeHashCode();
  }
  return this.hashCode;
}

private int computeHashCode() {
   int hash = 503;
   hash = hash * 5381 + (a[0] + a[1]);
   hash = hash * 5381 + (b[0] + b[1] + b[2]);
   return hash;
}

Of course you have to be careful not to change contents of the key after the hashCode has been calculated for the first time.

Edit: It seems that caching has code values is not worthwhile when you are adding each key only once to a map. In some other situation this could be useful.

Juha Syrjälä 2009-11-18 17:49:32

As is pointed out below, there's no recomputation of the hashcodes of objects in a HashMap when it's resized, so this doesn't gain you anything.

delfuego 2009-11-18 17:56:50

And, the performance gain is negligible

OscarRyz 2009-11-18 20:19:22

Answer 14

+5 A:

I'd suggest a three-pronged approach:

Run Java with more memory: java -Xmx256M for example to run with 256 Megabytes. Use more if needed and you have lots of RAM.
Cache your calculated hash values as suggested by another poster, so each object only calculates its hash value once.
Use a better hashing algorithm. The one you posted would return the same hash where a = {0, 1} as it would where a ={1, 0}, all else being equal.

Utilise what Java gives you for free.

public int hashCode() {
    return 31 * Arrays.hashCode(a) + Arrays.hashCode(b);
}

I'm pretty sure this has much less chance of clashing than your existing hashCode method, although it depends on the exact nature of your data.

Steve McLeod 2009-11-18 18:00:40

RAM might be way to small for these kind of maps and arrays, so I already suspected a memory limitation problem.

ReneS 2009-11-18 18:04:34

Answer 15

+5 A:

If the arrays in your posted hashCode are bytes, then you will likely end up with lots of duplicates.

a[0] + a[1] will always be between 0 and 512. adding the b's will always result in a number between 0 and 768. multiply those and you get an upper limit of 400,000 unique combinations, assuming your data is perfectly distributed among every possible value of each byte. If your data is at all regular, you likely have far less unique outputs of this method.

Peter Recore 2009-11-18 18:05:43

Answer 16

+6 A:

One thing I notice in your hashCode method is that the order of the elements in the arrays a[] and b[] don't matter. Thus (a[]={1,2,3}, b[]={99,100}) will hash to the same value as (a[]={3,1,2}, b[]={100,99}). Actually all keys k1 and k2 where sum(k1.a)==sum(k2.a) and sum(k1.b)=sum(k2.b) will result in collisions. I suggest assigning a weight to each position of the array:

hash = hash * 5381 + (c0*a[0] + c1*a[1]);
hash = hash * 5381 + (c0*b[0] + c1*b[1] + c3*b[2]);

where, c0,c1 and c3 are distinct constants (you can use different constants for b if necessary). That should even out things a bit more.

MAK 2009-11-18 19:44:20

That is a good suggestion.

Nash0 2009-11-18 20:17:39

Although I should also add that it won't work for me because I want the property that arrays with the same elements in different orders give the same hashcode.

Nash0 2009-11-18 20:34:22

In that case, you have 52C2 + 52C3 hashcodes (23426 according to my calculator), and a hashmap is very much the wrong tool for the job.

kdgregory 2009-11-18 20:51:41

Actually this would increase the performance. The more collisions eq less entries in the hashtable eq. less work to do. Is not the hash ( which looks fine ) nor the hashtable ( which works great ) I'd bet it is on the object creation where the performance is degrading.

OscarRyz 2009-11-18 21:10:27

@Oscar - more collisions equals *more* work to do, because now you have to do a linear search of the hash chain. If you have 26,000,000 distinct values per equals(), and 26,000 distinct values per hashCode(), then the bucket chains will have 1,000 objects each.

kdgregory 2009-11-18 21:11:16

@Nash0: You appear to be saying that you want these to have the same hashCode but at the same time not being equal (as defined by the equals() method). Why would you want that?

MAK 2009-11-18 21:32:50

@kdgregory: Yes, but only if the collision happens with **different** classes, for the same class ( which is the case ) the same entry is used.

OscarRyz 2009-11-18 21:51:08

@Oscar - I'm not sure I follow you: how does class matter? A HashMap works by picking a bucket chain by dividing the hashcode value by the number of buckets in the map (the "table" variable in Sun's implementation). Each bucket holds a linked list of entries, and the map walks that list calling equals() on every node.

kdgregory 2009-11-19 00:50:40

@Nash0: To followup up last comment, if you cannot change the hashCode method, I think using a TreeMap would give you better performance.

MAK 2009-11-19 08:36:10

@kdgregory: I get 512*768 = 391680 possible hashes. Am I smoking dope?

Trevor Harrison 2009-11-19 16:45:09

Answer 17

+1 A:

Another poster already pointed out that your hashcode implementation will result in a lot of collisions due to the way that you're adding values together. I'm willing to be that, if you look at the HashMap object in a debugger, you'll find that you have maybe 200 distinct hash values, with extremely long bucket chains.

If you always have values in the range 0..51, each of those values will take 6 bits to represent. If you always have 5 values, you can create a 30-bit hashcode with left-shifts and additions:

    int code = a[0];
    code = (code << 6) + a[1];
    code = (code << 6) + b[0];
    code = (code << 6) + b[1];
    code = (code << 6) + b[2];
    return code;

The left-shift is fast, but will leave you with hashcodes that aren't evenly distributed (because 6 bits implies a range 0..63). An alternative is to multiply the hash by 51 and add each value. This still won't be perfectly distributed (eg, {2,0} and {1,52} will collide), and will be slower than the shift.

    int code = a[0];
    code *= 51 + a[1];
    code *= 51 + b[0];
    code *= 51 + b[1];
    code *= 51 + b[2];
    return code;

kdgregory 2009-11-18 20:36:08

@kdgregory: I've answered about the "more collisions implies more work" somewhere else :)

OscarRyz 2009-11-18 21:58:40

Answer 18

+4 A:

If the two byte arrays you mention is your entire key, the values are in the range 0-51, unique and the order within the a and b arrays is insignificant, my math tells me that there is only just about 26 million possible permutations and that you likely are trying to fill the map with values for all possible keys.

In this case, both filling and retrieving values from your data store would of course be much faster if you use an array instead of a HashMap and index it from 0 to 25989599.

jarnbjo 2009-11-18 21:08:25

That is a very good idea, and in fact I am doing that for another data storage issue with 1.2 billion elements. In this case I wanted to take the easy way out and use a premade data structure :)

Nash0 2009-11-19 05:54:08

Answer 19

+10 A:

As many people pointed out the hashCode() method was to blame. It was only generating around 20,000 codes for 26 million distinct objects. That is an average of 1,300 objects per hash bucket = very very bad. However if I turn the two arrays into a number in base 52 I am guaranteed to get a unique hash code for every object:

public int hashCode() {       
    Arrays.sort(a);
    Arrays.sort(b);       
    return a[0] + powerOf52(a[1], 1) + powerOf52(b[0], 2) + powerOf52(b[1], 3) + powerOf52(b[2], 4);
}

public static int powerOf52(byte b, int power) {
    int result = b;
    for (int i = 0; i < power; i++) {
        result *= 52;
    }
    return result;
}

The arrays are sorted to ensure this methods fulfills the hashCode() contract that equal objects have the same hash code. Using the old method the average number of puts per second over blocks of 100,000 puts 100,000 to 2,000,000 was:

Using the new method gives:

Much much better. The old method tailed off very quickly while the new one keeps up a good throughput.

Nash0 2009-11-19 06:14:33

I suggest not modifying the arrays in the `hashCode` method. By convention, `hashCode` does not change the object's state. Perhaps the constructor would be a better place to sort them.

Michael Myers 2009-11-19 15:52:52

I agree that the sorting of the arrays should happen in the constructor. The code shown never seems to set the hashCode. Calculating the code can be done simpler as follows: `int result = a[0]; result = result * 52 + a[1]; //etc`.

rsp 2009-11-19 16:20:08

I agree that sorting in the constructor and then calculating the hash code as mmyers and rsp suggest is better. In my case my solution is acceptable and I wanted to highlight the fact that the arrays must be sorted for `hashCode()` to work.

Nash0 2009-11-19 17:00:11

Answer 20

+3 A:

Getting into the gray area of "on/off topic", but necessary to eliminate confusion regarding Oscar Reyes suggestion that more hash collisions is a good thing because it reduces the number of elements in the HashMap. I may misunderstand what Oscar is saying, but I don't seem to be the only one: kdgregory, delfuego, Nash0, and I all seem to share the same (mis)understanding.

If I understand what Oscar is saying about the same class with the same hashcode, he's proposing that only one instance of a class with a given hashcode will be inserted into the HashMap. Eg, if I have an instance of SomeClass with a hashcode of 1 and a second instance of SomeClass with a hashcode of 1, only one instance of SomeClass is inserted.

The Java pastebin example at http://pastebin.com/f20af40b9 seems to indicate the above correctly summarizes what Oscar is proposing.

Regardless of any understanding or misunderstanding, what happens is different instances of the same class do not get inserted only once into the HashMap if they have the same hashcode - not until it's determined whether the keys are equal or not. The hashcode contract requires that equal objects have the same hashcode; however, it doesn't require that unequal objects have different hashcodes (although this may be desirable for other reasons)[1].

The pastebin.com/f20af40b9 example (which Oscar refers to at least twice) follows, but modified slightly to use JUnit assertions rather than printlines. This example is used to support the proposal that the same hashcodes cause collisions and when the classes are the same only one entry is created (eg, only one String in this specific case):

@Test
public void shouldOverwriteWhenEqualAndHashcodeSame() {
    String s = new String("ese");
    String ese = new String("ese");
    // same hash right?
    assertEquals(s.hashCode(), ese.hashCode());
    // same class
    assertEquals(s.getClass(), ese.getClass());
    // AND equal
    assertTrue(s.equals(ese));

    Map map = new HashMap();
    map.put(s, 1);
    map.put(ese, 2);
    SomeClass some = new SomeClass();
    // still  same hash right?
    assertEquals(s.hashCode(), ese.hashCode());
    assertEquals(s.hashCode(), some.hashCode());

    map.put(some, 3);
    // what would we get?
    assertEquals(2, map.size());

    assertEquals(2, map.get("ese"));
    assertEquals(3, map.get(some));

    assertTrue(s.equals(ese) && s.equals("ese"));
}

class SomeClass {
    public int hashCode() {
        return 100727;
    }
}

However, the hashcode isn't the complete story. What the pastebin example neglects is the fact that both s and ese are equal: they are both the string "ese". Thus, inserting or getting the contents of the map using s or ese or "ese" as the key are all equivalent because s.equals(ese) && s.equals("ese").

A second test demonstrates it is erroneous to conclude that identical hashcodes on the same class is the reason the key -> value s -> 1 is overwritten by ese -> 2 when map.put(ese, 2) is called in test one. In test two, s and ese still have the same hashcode (as verified by assertEquals(s.hashCode(), ese.hashCode());) AND they are the same class. However, s and ese are MyString instances in this test, not Java String instances - with the only difference relevant for this test being the equals: String s equals String ese in test one above, whereas MyStrings s does not equal MyString ese in test two:

@Test
public void shouldInsertWhenNotEqualAndHashcodeSame() {
    MyString s = new MyString("ese");
    MyString ese = new MyString("ese");
    // same hash right?
    assertEquals(s.hashCode(), ese.hashCode());
    // same class
    assertEquals(s.getClass(), ese.getClass());
    // BUT not equal
    assertFalse(s.equals(ese));

    Map map = new HashMap();
    map.put(s, 1);
    map.put(ese, 2);
    SomeClass some = new SomeClass();
    // still  same hash right?
    assertEquals(s.hashCode(), ese.hashCode());
    assertEquals(s.hashCode(), some.hashCode());

    map.put(some, 3);
    // what would we get?
    assertEquals(3, map.size());

    assertEquals(1, map.get(s));
    assertEquals(2, map.get(ese));
    assertEquals(3, map.get(some));
}

/**
 * NOTE: equals is not overridden so the default implementation is used
 * which means objects are only equal if they're the same instance, whereas
 * the actual Java String class compares the value of its contents.
 */
class MyString {
    String i;

    MyString(String i) {
        this.i = i;
    }

    @Override
    public int hashCode() {
        return 100727;
    }
}

Based on a later comment, Oscar seems to reverse what he's said earlier and acknowledges the importance of equals. However, it still seems the notion that equals is what matters, not the "same class", is unclear (emphasis mine):

"Not really. The list is created only if the hash is the same, but the key is different. For instance if a String give hashcode 2345 and and Integer gives the same hashcode 2345, then the integer is inserted into the list because String.equals( Integer ) is false. But if you have the same class ( or at least .equals returns true ) then the same entry is used. For instance new String("one") and `new String("one") used as keys, will use the same entry. Actually this is the WHOLE point of HashMap in first place! See for yourself: pastebin.com/f20af40b9 – Oscar Reyes"

versus earlier comments that explicitly address the importance of identical class and same hashcode, with no mention of equals:

"@delfuego: See for yourself: pastebin.com/f20af40b9 So, in this question the same class is being used ( wait a minute, the same class is being used right? ) Which implies that when the same hash is used the same entry is used and there is not "list" of entries. – Oscar Reyes"

or

"Actually this would increase the performance. The more collisions eq less entries in the hashtable eq. less work to do. Is not the hash ( which looks fine ) nor the hashtable ( which works great ) I'd bet it is on the object creation where the performance is degrading. – Oscar Reyes"

or

"@kdgregory: Yes, but only if the collision happens with different classes, for the same class ( which is the case ) the same entry is used. – Oscar Reyes"

Again, I may misunderstand what Oscar was actually trying to say. However, his original comments have caused enough confusion that it seems prudent to clear everything up with some explicit tests so there are no lingering doubts.

[1] - From Effective Java, Second Edition by Joshua Bloch:

Whenever it is invoked on the same object more than once during an execution of an application, the hashCode method must consistently return the same integer, provided no information used in equal s comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equal s(Obj ect) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equal s(Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

ckeh 2009-11-19 08:16:59

Answer 21

+2 A:

I'm late here, but a couple comments about big maps:

As discussed at length in other posts, with a good hashCode(), 26M entries in a Map is no big deal.
However, a potentially hidden issue here is GC impact of giant maps.

I'm making an assumption that these maps are long lived. i.e. you populate them and they stick around for the duration of the app. I'm also assuming the app itself is long lived -- like a server of some sort.

Each entry in a Java HashMap requires three objects: the key, the value, and the Entry that ties them together. So 26M entries in the map means 26M * 3 == 78M objects. This is fine until you hit a full GC. Then you've got a pause-the-world problem. The GC will look at each of the 78M objects and determine they're all alive. 78M+ objects is just a lot of objects to look at. If your app can tolerate occasional long (perhaps many seconds) pauses, there's no issue. If you're trying to achieve any latency guarantees you could have a major issue (of course if you want latency guarantees, Java's not the platform to choose :)) If the values in your maps churn quickly you can end up with frequent full collects which compounds the problem greatly.

I don't know of a great solution to this issue. Ideas:

It's sometimes possible to tune GC and heap sizes to "mostly" prevent full GCs.
If your map contents churn a lot you could try Javolution's FastMap -- it can pool Entry objects, which could lower the frequency of full collects
You could create your own map impl and do explicit memory management on byte[] (i.e. trade cpu for more predictable latency by serializing millions of objects into a single byte[] -- ugh!)
Don't use Java for this part -- talk to some sort of predictable in-memory DB over a socket
Hope that the new G1 collector will help (mainly applies to the high-churn case)

Just some thoughts from someone who has spent a lot of time with giant maps in Java.

overthink 2009-11-19 15:17:25

Answer 22

+1 A:

As pointed out, your hashcode implementation has too many collisions, and fixing it should result in decent performance. Moreover, caching hashCodes and implementing equals efficiently will help.

If you need to optimize even further:

By your description, there are only (52 * 51 / 2) * (52 * 51 * 50 / 6) = 29304600 different keys (of which 26000000, i.e. about 90%, will be present). Therefore, you can design a hash function without any collisions, and use a simple array rather than a hashmap to hold your data, reducing memory consumption and increasing lookup speed:

T[] array = new T[Key.maxHashCode];

void put(Key k, T value) {
    array[k.hashCode()] = value;

T get(Key k) {
    return array[k.hashCode()];
}

(Generally, it is impossible to design an efficient, collision-free hash function that clusters well, which is why a HashMap will tolerate collisions, which incurs some overhead)

Assuming a and b are sorted, you might use the following hash function:

public int hashCode() {
    assert a[0] < a[1]; 
    int ahash = a[1] * a[1] / 2 
              + a[0];

    assert b[0] < b[1] && b[1] < b[2];

    int bhash = b[2] * b[2] * b[2] / 6
              + b[1] * b[1] / 2
              + b[0];
    return bhash * 52 * 52 / 2 + ahash;
}

static final int maxHashCode = 52 * 52 / 2 * 52 * 52 * 52 / 6;

I think this is collision-free. Proving this is left as an exercise for the mathematically inclined reader.

meriton 2009-11-20 23:14:23

Answer 23

A:

Maybe try using if you need it to be synchronized

http://commons.apache.org/collections/api/org/apache/commons/collections/FastHashMap.html

01 2009-11-21 00:16:28

Answer 24

A:

I did a small test a while back with a list vs a hashmap, funny thing was iterating through the list and finding the object took the same amount of time in milliseconds as using the hashmaps get function... just an fyi. Oh yeah memory is a big issue when working with hashmaps that size.

Grep 2010-07-06 13:33:15

ansaurus

tags:

views:

answers:

Java HashMap performance optimization / alternative

related questions