views:

399

answers:

2

I need to store a big hash set, able to contain up to approx 200 millions 40 bit values. Storing it as 200 millions 64 bit value would be acceptable (despite the 200 millions * 16 bits loss).

The requirements are:

  • tiny memory footprint (disk space ain't an issue, memory is)

  • fast contains(long l) and add(long l) methods (much faster than SQL)

  • embedded

  • free and without nasty licensing (no Berkeley DB). LGPL fine.

  • no false positive and no false negative, so things like disk-based Bloom Filters are not what I'm after

SQL is not what I'm after here .

Because I really think I'm more after something fast like this (notice how the solution is much faster than a SQL solution):

http://stackoverflow.com/questions/495161/fast-disk-based-hashtables

Does Google have such a Java API?

Would a fast disk-based key/value pair implementation where I'd only use the 'key' work?

Or something else?

I'd rather not reinvent the weel.

+2  A: 

If you can afford 128 GB of disk, you could store one bit per 40 bit value. You could then use a random access file to check a bit is set or to change it. You wouldn't have to insert any values or maintain an index.

Peter Lawrey
@Peter Lawrey: interesting... Interestingly I had another question, before that, for something else, where I asked what was the fastest way in Java to achieve random lookup in big files :) Sadly it is not feasible in my case: this has to be installed on several computers. A few gigabytes are ok, but 128 is too much :( I like the approach that said...
cocotwo
A: 

I believe you will need to use a B-tree rather than a hash table. Hash tables do not have good locality for secondary storage, so you will lose too much time to disk I/O.

Most databases -- relational or not -- implement their indexes as a B-tree, so you are talking about the equivalent of storing an index with no other data attached to it.

Will you have multiple processes concurrently updating this data store?

Steven Huwig