views:

3808

answers:

24

I want to create a large HashMap but the put() performance is not good enough. Any ideas?

Other data structure suggestions are welcome but I need the lookup feature of a Java Map:

map.get(key)

In my case I want to create a map with 26 million entries. Using the standard Java HashMap the put rate becomes unbearably slow after 2-3 million insertions.

Also, does anyone know if using different hash code distributions for the keys could help?

My hashcode method:

byte[] a = new byte[2];
byte[] b = new byte[3];
...

public int hashCode() {
    int hash = 503;
    hash = hash * 5381 + (a[0] + a[1]);
    hash = hash * 5381 + (b[0] + b[1] + b[2]);
    return hash;
}

I am using the associative property of addition to ensure that equal objects have the same hashcode. The arrays are bytes with values in the range 0 - 51. Values are only used once in either array. The objects are equal if the a arrays contain the same values (in either order) and the same goes for the b array. So a = {0,1} b = {45,12,33} and a = {1,0} b = {33,45,12} are equal.

EDIT, some notes:

  • A few people have criticized using a hash map or other data structure to store 26 million entries. I cannot see why this would seem strange. It looks like a classic data structures and algorithms problem to me. I have 26 million items and I want to be able to quickly insert them into and look them up from a data structure: give me the data structure and algorithms.

  • Setting the initial capacity of the default Java HashMap to 26 million decreases the performance.

  • Some people have suggested using databases, in some other situations that is definitely the smart option. But I am really asking a data structures and algorithms question, a full database would be overkill and much slower than a good datastructure solution (after all the database is just software but would have communication and possibly disk overhead).

+8  A: 

26 million entries? Sounds like the data actually belongs in a database. I would then recommend to use a decent RDBMS with a proper datamodel and just use JDBC/SQL to obtain the specific entries from it.

BalusC
Not an answer to the actual question, but probably the best recommendation to do. I'm actually impressed only thinking about a single hashmap with so many values.
Gnoupi
That is not an option in my case because I want to performance of an in memory HashMap and the memory used is not an issue.
Nash0
@Gnoupi: "Other data structure suggestions are welcome" - clearly this is an answer to the OP.
280Z28
@Nash0: Obviously you *don't* want the performance of an in-memory HashMap or this post wouldn't exist.
280Z28
You are right, I want "fast in memory lookup" performance.
Nash0
@Nash0: You could use an in-memory HSQLDB or SQLite database, or use any other database but configure it to use a large amount of memory. For a drop-in replacement, you could use a Map implementation which is backed by the database.
rob
It's simply a pain, that much entries in a `Map`. It also insinuates that you aren't using a numbered sequence as key (like as in decent datamodels and in `List`) and that is really a performance pain.
BalusC
Apart from embedded RDBMS suggestions as HSQLDB/SQLite you could also take benefit of JDK's built-in JavaDB.
BalusC
This doesn't answer the question.
OscarRyz
To paraphrase oxbow_lakes: "[your favorite] DB is nowhere near fast enough for this number of entries due to the serialization/IO overhead; it could never be faster than a hashmap and the OP doesn't care about persistence. Your suggestion is not a good one."
ckeh
@ckeh: only if you have enough RAM, which is in practice unlikely with 26 million entries.
BalusC
"only if you have enough RAM, which is in practice unlikely with 26 million entries." - Perhaps... but that is speculation. Note that the OP didn't encounter an issue with RAM in practice, as per the answer: http://stackoverflow.com/questions/1757363/java-hashmap-performance-optimization-alternative/1761232#1761232
ckeh
It would be quite interesting if the OP posted information about RAM usage, etc, since a number of answers assumed that is too much data for an in memory solution while the OP's results indicate that was not the case.
ckeh
+3  A: 

HashMap has initial capacity and HashMap's performance very very depends on hashCode that produce underlying objects.

Try to tweak both.

Mykola Golubyev
+1  A: 

Are you specifying the proper capacity and load factor in the map constructor? Did you implement the hashCode method of the objects you are inserting?

GregB
Yes I implemented the `hashcode()` method and it satisfies the requirement that equal objects have the same hashcode.
Nash0
That's a necessary but not sufficient condition for a good hashCode. You also want different inputs to produce different outputs as often as possible.
Jay
+5  A: 

My first idea is to make sure you're initializing your HashMap appropriately. From the JavaDocs for HashMap:

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

So if you're starting off with a too-small HashMap, then every time it needs to resize, all the hashes are recomputed... which might be what you're feeling when you get to the 2-3 million insertion point.

delfuego
I don't think they are recomputed, ever. The table size is increased, the hashes are kept.
Henning
Hashmap just does a bit-wise and for every entry: newIndex = storedHash
Henning
Hanning: Perhaps poor wording on delfuego's part, but the point is valid. Yes, the hash values are not recomputed in the sense that the output of hashCode() is not recalculated. But When the table size is increased, all the keys must be re-inserted into the table, that is, the hash value must be re-hashed to get a new slot number in the table.
Jay
Jay, yep -- poor wording indeed, and what you said. :)
delfuego
Jay: But that's a relatively cheap operation, and as the size doubles every time, this can under no circumstances explain poor performance.
Henning
The strange thing is that setting the inital capacity equal to the number of elements I want to insert actually *decreases* the performance.
Nash0
@Henning: Well, the AND function itself is cheap, but you do have to loop through all the slots in the table, then chase the linked list on every slot, and then for each entry do the AND (plus a "length-1" -- seems to me they should have cached the subtraction, but whatever). None of that is expensive of itself, but doing it for millions of entries could add up. I haven't done any benchmarks, but I'd think it's worth looking at.
Jay
@Jay: I'm not against properly initializing the capacity, quite to the contrary. But if you start with the default capacity (16), you will only have to increase about 20 times for 26 million entries. That's a few million AND and copy operations total, sure, but could that explain a _significant_ drop in performance to the point where it's noticeable, let alone unbearable? The bad hashCode() scenario you describe in your answer sounds much better.
Henning
@delfuego and @nash0: Yeap, setting the initial capacity equal to the number of elements decreases the performance because you're having tons of millions of collisions and thus you're only using a small amount of that capacity. *Even* if you use all the available entries, setting the same capacity will make it worst!, because due to the load factor more space will be requested. You'll have to use `initialcapactity = maxentries/loadcapacity` ( such as 30M ,0.95 for 26M entries ) but this is **NOT** your case, since you're having all those collisions you're using only about 20k or less.
OscarRyz
Good point Oscar, it is the hash method that needs to be fixed.
Nash0
A: 

Allocate a large map in the beginning. If you know it will have 26 million entries and you have the memory for it, do a new HashMap(30000000).

Are you sure, you have enough memory for 26 million entries with 26 million keys and values? This sounds like a lot memory to me. Are you sure that the garbage collection is doing still fine at your 2 to 3 million mark? I could imagine that as a bottleneck.

ReneS
Oh, another thing. Your hash codes have to be evenly distributed to avoid large linked lists at single positions in the map.
ReneS
A: 
OscarRyz
Oscar, as pointed out elsewhere above (in response to your comments), you seem to be assuming that more collisions is GOOD; it's very much NOT good. A collision means that the slot at a given hash goes from containing a single entry to containing a list of entries, and this list has to be searched/traversed every time the slot is accessed.
delfuego
@delfuego: Not really, that happens only when you have a collision using different classes but for the same class the same entry is used ;)
OscarRyz
@delfuego: See for yourself: http://pastebin.com/f20af40b9 So, in this question the same class is being used ( wait a minute, the same class is being used right? ) Which implies that when the same hash is used the same entry is used and there is not "list" of entries.
OscarRyz
Actually I am adding 26 millions items, the number of items is (52 choose 5) * (5 choose 3) =~ 26 million. I am sorry but my loop is not doing much besides calling `put()`, it is the HashMap performance.
Nash0
I think I see what you mean by "not storing 26 millions objects", I am storing 26 million but they are only going in 20k hash buckets. From my experimentation that it true.
Nash0
@Nash0: MMhh I see, yeah you should know you code much better. I just can tell about my findings. I think you should definitely take a profiler, and let it tell you where the problem is. I still think it is not the hash map, but of course I don't have your code. Here's the code for my test if you find them useful. http://pastebin.com/f7a6cd5a5 Using your hashcode for 26M with no initial capacity runs in 8s
OscarRyz
That's the number I've got also 20K buckets, they are not really being stored.. wait a minute? Are you using something else than ram? Because if your serializing somehow that's where the performance is going to.
OscarRyz
@Oscar - see my response to you with MAK's answer. HashMap maintains a linked list of entries at each hash bucket, and walks that list calling equals() on every element. The object's class has nothing to do with it (other than a short-circuit on equals()).
kdgregory
@Oscar - Reading your answer it seems that you are assuming that equals() will return true if the hashcodes are the same. This is not part of the equals/hashcode contract. If I've misunderstood, ignore this comment.
kdgregory
Thank you very much for the effort Oscar but I think you are confusing the key objects being equal vs having the same hash code. Also in one of your code links you are using equals strings as the key, remember that strings in Java are immutable. I think we both learned a lot about hashing today :)
Nash0
@Nash0: OOoook, yeap, we learned a lot :) Now I got it, so you're returning the same hash for objects that are NOT equals, right? Yeap I miss that point. The javadoc says: *It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hashtables.*http://java.sun.com/javase/6/docs/api/java/lang/Object.html#hashCode()
OscarRyz
+6  A: 

To elaborate on Pascal: Do you understand how a HashMap works? You have some number of slots in your hash table. The hash value for each key is found, and then mapped to an entry in the table. If two hash values map to the same entry -- a "hash collision" -- HashMap builds a linked list.

Hash collisions can kill the performance of a hash map. In the extreme case, if all your keys have the same hash code, or if they have different hash codes but they all map to the same slot, then your hash map turns into a linked list.

So if you're seeing performance problems, the first thing I'd check is: Am I getting a random-looking distribution of hash codes? If not, you need a better hash function. Well, "better" in this case may mean "better for my particular set of data". Like, suppose you were working with strings, and you took the length of the string for the hash value. (Not how Java's String.hashCode works, but I'm just making up a simple example.) If your strings have widely varying lengths, from 1 to 10,000, and are fairly evenly distributed across that range, that this could be a very good hash function. But if your strings are all 1 or 2 characters, this would be a very bad hash function.

Edit: I should add: Every time you add a new entry, HashMap checks if this is a duplicate. When there's a hash collision, it has to compare the incoming key against every key that mapped to that slot. So in the worst case where everything hashes to a single slot, the second key is compared to the first key, the third key is compared to #1 and #2, the fourth key is compared to #1, #2, and #3, etc. By the time you get to key #1 million, you've done over a trillion compares.

@Oscar: Umm, I don't see how that's a "not really". It's more like a "let me clarify". But yes, it's true that if you make a new entry with the same key as an existing entry, that this overwrites the first entry. That's what I meant when I talked about looking for duplicates in the last paragraph: Whenever a key hashes to the same slot, HashMap must check if it's a duplicate of an existing key, or if they are just in the same slot by coincidence of the hash function. I don't know that that's the "whole point" of a HashMap: I would say that the "whole point" is that you can retrieve elements by key quickly.

But anyway, that doesn't affect the "whole point" that I was trying to make: When you have two keys -- yes, different keys, not the same key showing up again -- that map to the same slot in the table, HashMap builds a linked list. Then, because it has to check each new key to see if it is in fact a duplicate of an existing key, each attempt to add a new entry that maps to this same slot must chase the linked list examining each existing entry to see if this is a duplicate of a previously-seen key, or if it is a new key.

Jay
Yup, a bad hash function would result in this kind of behavior. +1
Henning
Not really. The list is created **only** if the hash is the same, but the key is **different**. For instance if a String give hashcode 2345 and and Integer gives the same hashcode 2345, then the integer is inserted into the list because `String.equals( Integer )` is `false`. **But** if you have the same class ( or at least `.equals` returns true ) then the same entry is used. For instance `new String("one")` and `new String("one") used as keys, will use the same entry. Actually this is the **WHOLE** point of HashMap in first place! See for yourself: http://pastebin.com/f20af40b9
OscarRyz
@Oscar: See my reply appended to my original post.
Jay
+2  A: 

You can try to use an in-memory database like HSQLDB.

Adrian
+1  A: 

Have you considered using a embeded database to do this. Look at Berkeley DB. It is open-source, owned by Oracle now.

It stores everything as Key->Value pair, it is NOT an RDBMS. and it aims to be fast.

coolest_head
Berkeley DB is nowhere near fast enough for this number of entries due to the serialization/IO overhead; it could never be faster than a hashmap and the OP doesn't care about persistence. Your suggestion is not a good one.
oxbow_lakes
+1  A: 

SQLite lets you use it in memory.

JRL
+1  A: 

First you should check that you are using Map correctly, good hashCode() method for keys, initial capacity for Map, right Map implementation etc. like many other answers describe.

Then I would suggest using a profiler to see what is actually happening and where the execution time is spent. Is, for example, hashCode() method executed for billions of times?

If that doesn't help, how about using something like EHCache or memcached? Yes, they are products for caching but you could configure them so that they will have enough capacity and will never evict any values from cache storage.

Another option would be some database engine that is lighter weight than full SQL RDBMS. Something like Berkeley DB, maybe.

Note, that I have personally no experience of these products' performance, but they could be worth the try.

Juha Syrjälä
+3  A: 

If the keys have any pattern to them then you can split the map into smaller maps and have a index map.

Example: Keys: 1,2,3,.... n 28 maps of 1 million each. Index map: 1-1,000,000 -> Map1 1,000,000-2,000,000 -> Map2

So you'll be doing two lookups but the key set would be 1,000,000 vs 28,000,000. You can easily do this with sting patterns also.

If the keys are completely random then this will not work

coolest_head
Even if the keys are random, you could use (key.hashCode() % 28) to select a map where to store that key-value.
Juha Syrjälä
Interesting, it could work.
Nash0
A: 

You could try to cache computed hash code to the key object.

Something like this:

public int hashCode() {
  if(this.hashCode == null) {
     this.hashCode = computeHashCode();
  }
  return this.hashCode;
}

private int computeHashCode() {
   int hash = 503;
   hash = hash * 5381 + (a[0] + a[1]);
   hash = hash * 5381 + (b[0] + b[1] + b[2]);
   return hash;
}

Of course you have to be careful not to change contents of the key after the hashCode has been calculated for the first time.

Edit: It seems that caching has code values is not worthwhile when you are adding each key only once to a map. In some other situation this could be useful.

Juha Syrjälä
As is pointed out below, there's no recomputation of the hashcodes of objects in a HashMap when it's resized, so this doesn't gain you anything.
delfuego
And, the performance gain is negligible
OscarRyz
+5  A: 

I'd suggest a three-pronged approach:

  1. Run Java with more memory: java -Xmx256M for example to run with 256 Megabytes. Use more if needed and you have lots of RAM.

  2. Cache your calculated hash values as suggested by another poster, so each object only calculates its hash value once.

  3. Use a better hashing algorithm. The one you posted would return the same hash where a = {0, 1} as it would where a ={1, 0}, all else being equal.

Utilise what Java gives you for free.

public int hashCode() {
    return 31 * Arrays.hashCode(a) + Arrays.hashCode(b);
}

I'm pretty sure this has much less chance of clashing than your existing hashCode method, although it depends on the exact nature of your data.

Steve McLeod
RAM might be way to small for these kind of maps and arrays, so I already suspected a memory limitation problem.
ReneS
+5  A: 

If the arrays in your posted hashCode are bytes, then you will likely end up with lots of duplicates.

a[0] + a[1] will always be between 0 and 512. adding the b's will always result in a number between 0 and 768. multiply those and you get an upper limit of 400,000 unique combinations, assuming your data is perfectly distributed among every possible value of each byte. If your data is at all regular, you likely have far less unique outputs of this method.

Peter Recore
+6  A: 

One thing I notice in your hashCode method is that the order of the elements in the arrays a[] and b[] don't matter. Thus (a[]={1,2,3}, b[]={99,100}) will hash to the same value as (a[]={3,1,2}, b[]={100,99}). Actually all keys k1 and k2 where sum(k1.a)==sum(k2.a) and sum(k1.b)=sum(k2.b) will result in collisions. I suggest assigning a weight to each position of the array:

hash = hash * 5381 + (c0*a[0] + c1*a[1]);
hash = hash * 5381 + (c0*b[0] + c1*b[1] + c3*b[2]);

where, c0,c1 and c3 are distinct constants (you can use different constants for b if necessary). That should even out things a bit more.

MAK
That is a good suggestion.
Nash0
Although I should also add that it won't work for me because I want the property that arrays with the same elements in different orders give the same hashcode.
Nash0
In that case, you have 52C2 + 52C3 hashcodes (23426 according to my calculator), and a hashmap is very much the wrong tool for the job.
kdgregory
Actually this would increase the performance. The more collisions eq less entries in the hashtable eq. less work to do. Is not the hash ( which looks fine ) nor the hashtable ( which works great ) I'd bet it is on the object creation where the performance is degrading.
OscarRyz
@Oscar - more collisions equals *more* work to do, because now you have to do a linear search of the hash chain. If you have 26,000,000 distinct values per equals(), and 26,000 distinct values per hashCode(), then the bucket chains will have 1,000 objects each.
kdgregory
@Nash0: You appear to be saying that you want these to have the same hashCode but at the same time not being equal (as defined by the equals() method). Why would you want that?
MAK
@kdgregory: Yes, but only if the collision happens with **different** classes, for the same class ( which is the case ) the same entry is used.
OscarRyz
@Oscar - I'm not sure I follow you: how does class matter? A HashMap works by picking a bucket chain by dividing the hashcode value by the number of buckets in the map (the "table" variable in Sun's implementation). Each bucket holds a linked list of entries, and the map walks that list calling equals() on every node.
kdgregory
@Nash0: To followup up last comment, if you cannot change the hashCode method, I think using a TreeMap would give you better performance.
MAK
@kdgregory: I get 512*768 = 391680 possible hashes. Am I smoking dope?
Trevor Harrison
+1  A: 

Another poster already pointed out that your hashcode implementation will result in a lot of collisions due to the way that you're adding values together. I'm willing to be that, if you look at the HashMap object in a debugger, you'll find that you have maybe 200 distinct hash values, with extremely long bucket chains.

If you always have values in the range 0..51, each of those values will take 6 bits to represent. If you always have 5 values, you can create a 30-bit hashcode with left-shifts and additions:

    int code = a[0];
    code = (code << 6) + a[1];
    code = (code << 6) + b[0];
    code = (code << 6) + b[1];
    code = (code << 6) + b[2];
    return code;

The left-shift is fast, but will leave you with hashcodes that aren't evenly distributed (because 6 bits implies a range 0..63). An alternative is to multiply the hash by 51 and add each value. This still won't be perfectly distributed (eg, {2,0} and {1,52} will collide), and will be slower than the shift.

    int code = a[0];
    code *= 51 + a[1];
    code *= 51 + b[0];
    code *= 51 + b[1];
    code *= 51 + b[2];
    return code;
kdgregory
@kdgregory: I've answered about the "more collisions implies more work" somewhere else :)
OscarRyz
+4  A: 

If the two byte arrays you mention is your entire key, the values are in the range 0-51, unique and the order within the a and b arrays is insignificant, my math tells me that there is only just about 26 million possible permutations and that you likely are trying to fill the map with values for all possible keys.

In this case, both filling and retrieving values from your data store would of course be much faster if you use an array instead of a HashMap and index it from 0 to 25989599.

jarnbjo
That is a very good idea, and in fact I am doing that for another data storage issue with 1.2 billion elements. In this case I wanted to take the easy way out and use a premade data structure :)
Nash0
+10  A: 

As many people pointed out the hashCode() method was to blame. It was only generating around 20,000 codes for 26 million distinct objects. That is an average of 1,300 objects per hash bucket = very very bad. However if I turn the two arrays into a number in base 52 I am guaranteed to get a unique hash code for every object:

public int hashCode() {       
    Arrays.sort(a);
    Arrays.sort(b);       
    return a[0] + powerOf52(a[1], 1) + powerOf52(b[0], 2) + powerOf52(b[1], 3) + powerOf52(b[2], 4);
}

public static int powerOf52(byte b, int power) {
    int result = b;
    for (int i = 0; i < power; i++) {
        result *= 52;
    }
    return result;
}

The arrays are sorted to ensure this methods fulfills the hashCode() contract that equal objects have the same hash code. Using the old method the average number of puts per second over blocks of 100,000 puts 100,000 to 2,000,000 was:

168350.17
109409.195
81344.91
64319.023
53780.79
45931.258
39680.29
34972.676
31354.514
28343.062
25562.371
23850.695
22299.22
20998.006
19797.799
18702.951
17702.434
16832.182
16084.52
15353.083

Using the new method gives:

337837.84
337268.12
337078.66
336983.97
313873.2
317460.3
317748.5
320000.0
309704.06
310752.03
312944.5
265780.75
275540.5
264350.44
273522.97
270910.94
279008.7
276285.5
283455.16
289603.25

Much much better. The old method tailed off very quickly while the new one keeps up a good throughput.

Nash0
I suggest not modifying the arrays in the `hashCode` method. By convention, `hashCode` does not change the object's state. Perhaps the constructor would be a better place to sort them.
Michael Myers
I agree that the sorting of the arrays should happen in the constructor. The code shown never seems to set the hashCode. Calculating the code can be done simpler as follows: `int result = a[0]; result = result * 52 + a[1]; //etc`.
rsp
I agree that sorting in the constructor and then calculating the hash code as mmyers and rsp suggest is better. In my case my solution is acceptable and I wanted to highlight the fact that the arrays must be sorted for `hashCode()` to work.
Nash0
+3  A: 

Getting into the gray area of "on/off topic", but necessary to eliminate confusion regarding Oscar Reyes suggestion that more hash collisions is a good thing because it reduces the number of elements in the HashMap. I may misunderstand what Oscar is saying, but I don't seem to be the only one: kdgregory, delfuego, Nash0, and I all seem to share the same (mis)understanding.

If I understand what Oscar is saying about the same class with the same hashcode, he's proposing that only one instance of a class with a given hashcode will be inserted into the HashMap. Eg, if I have an instance of SomeClass with a hashcode of 1 and a second instance of SomeClass with a hashcode of 1, only one instance of SomeClass is inserted.

The Java pastebin example at http://pastebin.com/f20af40b9 seems to indicate the above correctly summarizes what Oscar is proposing.

Regardless of any understanding or misunderstanding, what happens is different instances of the same class do not get inserted only once into the HashMap if they have the same hashcode - not until it's determined whether the keys are equal or not. The hashcode contract requires that equal objects have the same hashcode; however, it doesn't require that unequal objects have different hashcodes (although this may be desirable for other reasons)[1].

The pastebin.com/f20af40b9 example (which Oscar refers to at least twice) follows, but modified slightly to use JUnit assertions rather than printlines. This example is used to support the proposal that the same hashcodes cause collisions and when the classes are the same only one entry is created (eg, only one String in this specific case):

@Test
public void shouldOverwriteWhenEqualAndHashcodeSame() {
    String s = new String("ese");
    String ese = new String("ese");
    // same hash right?
    assertEquals(s.hashCode(), ese.hashCode());
    // same class
    assertEquals(s.getClass(), ese.getClass());
    // AND equal
    assertTrue(s.equals(ese));

    Map map = new HashMap();
    map.put(s, 1);
    map.put(ese, 2);
    SomeClass some = new SomeClass();
    // still  same hash right?
    assertEquals(s.hashCode(), ese.hashCode());
    assertEquals(s.hashCode(), some.hashCode());

    map.put(some, 3);
    // what would we get?
    assertEquals(2, map.size());

    assertEquals(2, map.get("ese"));
    assertEquals(3, map.get(some));

    assertTrue(s.equals(ese) && s.equals("ese"));
}

class SomeClass {
    public int hashCode() {
        return 100727;
    }
}

However, the hashcode isn't the complete story. What the pastebin example neglects is the fact that both s and ese are equal: they are both the string "ese". Thus, inserting or getting the contents of the map using s or ese or "ese" as the key are all equivalent because s.equals(ese) && s.equals("ese").

A second test demonstrates it is erroneous to conclude that identical hashcodes on the same class is the reason the key -> value s -> 1 is overwritten by ese -> 2 when map.put(ese, 2) is called in test one. In test two, s and ese still have the same hashcode (as verified by assertEquals(s.hashCode(), ese.hashCode());) AND they are the same class. However, s and ese are MyString instances in this test, not Java String instances - with the only difference relevant for this test being the equals: String s equals String ese in test one above, whereas MyStrings s does not equal MyString ese in test two:

@Test
public void shouldInsertWhenNotEqualAndHashcodeSame() {
    MyString s = new MyString("ese");
    MyString ese = new MyString("ese");
    // same hash right?
    assertEquals(s.hashCode(), ese.hashCode());
    // same class
    assertEquals(s.getClass(), ese.getClass());
    // BUT not equal
    assertFalse(s.equals(ese));

    Map map = new HashMap();
    map.put(s, 1);
    map.put(ese, 2);
    SomeClass some = new SomeClass();
    // still  same hash right?
    assertEquals(s.hashCode(), ese.hashCode());
    assertEquals(s.hashCode(), some.hashCode());

    map.put(some, 3);
    // what would we get?
    assertEquals(3, map.size());

    assertEquals(1, map.get(s));
    assertEquals(2, map.get(ese));
    assertEquals(3, map.get(some));
}

/**
 * NOTE: equals is not overridden so the default implementation is used
 * which means objects are only equal if they're the same instance, whereas
 * the actual Java String class compares the value of its contents.
 */
class MyString {
    String i;

    MyString(String i) {
        this.i = i;
    }

    @Override
    public int hashCode() {
        return 100727;
    }
}

Based on a later comment, Oscar seems to reverse what he's said earlier and acknowledges the importance of equals. However, it still seems the notion that equals is what matters, not the "same class", is unclear (emphasis mine):

"Not really. The list is created only if the hash is the same, but the key is different. For instance if a String give hashcode 2345 and and Integer gives the same hashcode 2345, then the integer is inserted into the list because String.equals( Integer ) is false. But if you have the same class ( or at least .equals returns true ) then the same entry is used. For instance new String("one") and `new String("one") used as keys, will use the same entry. Actually this is the WHOLE point of HashMap in first place! See for yourself: pastebin.com/f20af40b9 – Oscar Reyes"

versus earlier comments that explicitly address the importance of identical class and same hashcode, with no mention of equals:

"@delfuego: See for yourself: pastebin.com/f20af40b9 So, in this question the same class is being used ( wait a minute, the same class is being used right? ) Which implies that when the same hash is used the same entry is used and there is not "list" of entries. – Oscar Reyes"

or

"Actually this would increase the performance. The more collisions eq less entries in the hashtable eq. less work to do. Is not the hash ( which looks fine ) nor the hashtable ( which works great ) I'd bet it is on the object creation where the performance is degrading. – Oscar Reyes"

or

"@kdgregory: Yes, but only if the collision happens with different classes, for the same class ( which is the case ) the same entry is used. – Oscar Reyes"

Again, I may misunderstand what Oscar was actually trying to say. However, his original comments have caused enough confusion that it seems prudent to clear everything up with some explicit tests so there are no lingering doubts.


[1] - From Effective Java, Second Edition by Joshua Bloch:

  • Whenever it is invoked on the same object more than once during an execution of an application, the hashCode method must consistently return the same integer, provided no information used in equal s comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

  • If two objects are equal according to the equal s(Obj ect) method, then calling the hashCode method on each of the two objects must produce the same integer result.

  • It is not required that if two objects are unequal according to the equal s(Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

ckeh
+2  A: 

I'm late here, but a couple comments about big maps:

  1. As discussed at length in other posts, with a good hashCode(), 26M entries in a Map is no big deal.
  2. However, a potentially hidden issue here is GC impact of giant maps.

I'm making an assumption that these maps are long lived. i.e. you populate them and they stick around for the duration of the app. I'm also assuming the app itself is long lived -- like a server of some sort.

Each entry in a Java HashMap requires three objects: the key, the value, and the Entry that ties them together. So 26M entries in the map means 26M * 3 == 78M objects. This is fine until you hit a full GC. Then you've got a pause-the-world problem. The GC will look at each of the 78M objects and determine they're all alive. 78M+ objects is just a lot of objects to look at. If your app can tolerate occasional long (perhaps many seconds) pauses, there's no issue. If you're trying to achieve any latency guarantees you could have a major issue (of course if you want latency guarantees, Java's not the platform to choose :)) If the values in your maps churn quickly you can end up with frequent full collects which compounds the problem greatly.

I don't know of a great solution to this issue. Ideas:

  • It's sometimes possible to tune GC and heap sizes to "mostly" prevent full GCs.
  • If your map contents churn a lot you could try Javolution's FastMap -- it can pool Entry objects, which could lower the frequency of full collects
  • You could create your own map impl and do explicit memory management on byte[] (i.e. trade cpu for more predictable latency by serializing millions of objects into a single byte[] -- ugh!)
  • Don't use Java for this part -- talk to some sort of predictable in-memory DB over a socket
  • Hope that the new G1 collector will help (mainly applies to the high-churn case)

Just some thoughts from someone who has spent a lot of time with giant maps in Java.


overthink
+1  A: 

As pointed out, your hashcode implementation has too many collisions, and fixing it should result in decent performance. Moreover, caching hashCodes and implementing equals efficiently will help.

If you need to optimize even further:

By your description, there are only (52 * 51 / 2) * (52 * 51 * 50 / 6) = 29304600 different keys (of which 26000000, i.e. about 90%, will be present). Therefore, you can design a hash function without any collisions, and use a simple array rather than a hashmap to hold your data, reducing memory consumption and increasing lookup speed:

T[] array = new T[Key.maxHashCode];

void put(Key k, T value) {
    array[k.hashCode()] = value;

T get(Key k) {
    return array[k.hashCode()];
}

(Generally, it is impossible to design an efficient, collision-free hash function that clusters well, which is why a HashMap will tolerate collisions, which incurs some overhead)

Assuming a and b are sorted, you might use the following hash function:

public int hashCode() {
    assert a[0] < a[1]; 
    int ahash = a[1] * a[1] / 2 
              + a[0];

    assert b[0] < b[1] && b[1] < b[2];

    int bhash = b[2] * b[2] * b[2] / 6
              + b[1] * b[1] / 2
              + b[0];
    return bhash * 52 * 52 / 2 + ahash;
}

static final int maxHashCode = 52 * 52 / 2 * 52 * 52 * 52 / 6;

I think this is collision-free. Proving this is left as an exercise for the mathematically inclined reader.

meriton
A: 

Maybe try using if you need it to be synchronized

http://commons.apache.org/collections/api/org/apache/commons/collections/FastHashMap.html

01
A: 

I did a small test a while back with a list vs a hashmap, funny thing was iterating through the list and finding the object took the same amount of time in milliseconds as using the hashmaps get function... just an fyi. Oh yeah memory is a big issue when working with hashmaps that size.

Grep