Java: fast disk-based hash set

views:

399

answers:

+4 Q:

Java: fast disk-based hash set

I need to store a big hash set, able to contain up to approx 200 millions 40 bit values. Storing it as 200 millions 64 bit value would be acceptable (despite the 200 millions * 16 bits loss).

The requirements are:

tiny memory footprint (disk space ain't an issue, memory is)
fast contains(long l) and add(long l) methods (much faster than SQL)
embedded
free and without nasty licensing (no Berkeley DB). LGPL fine.
no false positive and no false negative, so things like disk-based Bloom Filters are not what I'm after

SQL is not what I'm after here .

Because I really think I'm more after something fast like this (notice how the solution is much faster than a SQL solution):

http://stackoverflow.com/questions/495161/fast-disk-based-hashtables

Does Google have such a Java API?

Would a fast disk-based key/value pair implementation where I'd only use the 'key' work?

Or something else?

I'd rather not reinvent the weel.

+2 A:

If you can afford 128 GB of disk, you could store one bit per 40 bit value. You could then use a random access file to check a bit is set or to change it. You wouldn't have to insert any values or maintain an index.

Peter Lawrey 2010-02-27 09:21:10

@Peter Lawrey: interesting... Interestingly I had another question, before that, for something else, where I asked what was the fastest way in Java to achieve random lookup in big files :) Sadly it is not feasible in my case: this has to be installed on several computers. A few gigabytes are ok, but 128 is too much :( I like the approach that said...

cocotwo 2010-02-27 11:42:24

I believe you will need to use a B-tree rather than a hash table. Hash tables do not have good locality for secondary storage, so you will lose too much time to disk I/O.

Most databases -- relational or not -- implement their indexes as a B-tree, so you are talking about the equivalent of storing an index with no other data attached to it.

Will you have multiple processes concurrently updating this data store?

Steven Huwig 2010-03-02 13:50:32

ansaurus

tags:

views:

answers:

Java: fast disk-based hash set

related questions