views:

497

answers:

3

I am currently working on a programming related problem where I am attempted to make a massive hashmap of data. The key for the data is a custom low-memory implementation of a CharSequence that implements hashCode() and equals(...) and the value is am Integer object.

There may be millions of entries in this hashtable and I managed to drastically reduce memory use for the value by having the Integer be a pointer in a file to the data I wish to hash but thbe problem is that the key may be tens of bytes (on average 25 bytes) and that the keys need to be held in memory in the default implementation of HashMap.

I need a hashmap that has a low memory overhead and that can possibly page the keys to disk or alternatively store a hashed representation of the keys. If the keys are themselves hashed then I would be concerned about hash collisions.

Ideally, I would like to be able to store a million entries in the map per 50MB of heap space (one byte array of 25 bytes in the key and Integer object in the value part).

Does anyone have any experience with low-memory filesystem-backed Maps that are optimised for reducing the footprint of the keys?

Thanks,

Chris

+2  A: 

How about Berkeley DB Java Edition? Its StoredMap class looks like what you are looking for.

Kai Chan
+1  A: 

You could use Java's hash map and write a FileKey class that takes a RandomAccessFile, offset and length, precomputes the hash at construction and which implements Comparable by reading the data from the file just for the compare.

In conjunction with a simple MRU cache, you could keep some number of keys in memory using another hashmap which is keyed on the same keys, but which uses a custom comparator which compares just the offset and length values (not the file data).

Software Monkey
+1  A: 

I think that the default HashSet is not a bad way to go--make the key-value pair yourself (so you don't have to wrap them in an extra object). It is pretty memory-efficient that way; it really only requires about (1/loadFactor)^(3/2)*4 bytes more memory on top of your key object + 4 bytes for the value. In practice, this should add something like 8 bytes of overhead per entry. (You can reduce this further if you know in advance how many keys you're going to store.)

Rex Kerr