views:

732

answers:

7

I need a disk backed Map structure to use in a Java app. It must have the following criteria:

  1. Capable of storing millions of records (even billions)
  2. Fast lookup - the majority of operations on the Map will simply to see if a key already exists. This, and 1 above are the most important criteria. There should be an effective in memory caching mechanism for frequently used keys.
  3. Persistent, but does not need to be transactional, can live with some failure. i.e. happy to synch with disk periodically, and does not need to be transactional.
  4. Capable of storing simple primitive types - but I don't need to store serialised objects.
  5. It does not need to be distributed, i.e. will run all on one machine.
  6. Simple to set up & free to use.
  7. No sql type queries required

Records keys will be strings or longs. As described above reads will be much more frequent than writes, and the majority of reads will simply be to check if a key exists (i.e. will not need to read the keys associated data). Each record will be updated once only and records are not deleted.

I'm currently using Bdb JE but am looking for other options and am interested in hearing what others either recommend or advise against using.


Update

I have since improved the query performance on the existing BDB setup by reducing the dependency on secondary keys. Some of our queries required a join on two secondary keys. By reducing the number of secondary keys joined on (e.g by combining them into a composite key) we have removed a level of indirection in the lookup which speeds things up nicely.

+1  A: 

I'd likely use a local database. Like say Bdb JE or HSQLDB. May I ask what is wrong with this approach? You must have some reason to be looking for alternatives.

In response to comments: As the problem performance and I guess you are already using JDBC to handle this it might be worth trying HSQLB and reading the chapter on Memory and Disk Use.

mlk
+1 agree. I would use a regular DB and write a nice API for the requirements so that the backend can be switched easily.
flybywire
Once Bdb reaches the limits of what can be cached in memory i'm finding that it slows down unacceptably. This generally happens after about 1mm inserts.
Joel
How about HSQLDB? I'm going to guess they both JDBC so you should be able to slot it in without modifying much of your existing code.Would be worth reading:http://hsqldb.org/doc/2.0/guide/deployment-chapt.html#deployment_mem_disk-sect
mlk
BDBs slow down once you hit the point that you're thrashing your cache. BDBs essentially have a BTree in memory which tries to answer a request. If the request cannot be answered, the BDB pages in more data from disk. Once your working set is larger than your cache, you'll find trouble. There are JMX methods for monitoring the cache hit misses and cache size: use them to debug your application and if necessary increase the heap and give BDB more cache.
jasonmp85
Also HSQLDB is **not** an acceptable solution. While it can store a lot of data on disk, it does **not** stream that data from disk when doing reads. It reads the entire `ResultSet` into memory rather than paging it in as you iterate through it. If you ever need to walk over a large portion of a table this will blow out your memory. BDBs handle this just fine. I also believe the the h2 database (http://www.h2database.com/html/main.html ) claims to solve this, though I've never used it.
jasonmp85
@jasonmp85 - this is exactly what i've found - once the BDB BTree no longer fits in memory you're in trouble.
Joel
A: 

I think Hibernate Shards may easily fulfill all your requirements.

Boris Pavlović
A: 

memcached provides an excellent scalable map-based distributable cache. If you use this and back it with one of the databases mentioned to provide persistance, you may well solve your performance problems, at least for frequently hit keys (as long as you provide enough RAM to cache all the values that are frequently accessed).

Bill Michell
+1  A: 

SQLite does this. I wrote a wrapper for using it from Java: http://zentus.com/sqlitejdbc

As I mention in a comment, I have successfully used SQLite with gigabytes of data and tables of hundreds of millions of rows. If you think out the indexing properly, it's very fast.

The only pain is the JDBC interface. Compared to a simple HashMap, it is clunky. I often end up writing a JDBC-wrapper for the specific project, which can add up to a lot of boilerplate code.

David Crawshaw
I seriously doubt sqlite would scale to this many records.
Omry
I have successfully used SQLite with gigabytes of data and tables of hundreds of millions of rows. If you think out the indexing properly, it's very fast.
David Crawshaw
A: 

JBoss (tree) Cache is a great option. You can use it standalone from JBoss. Very robust, performant, and flexible.

james
Is it persistent?
Seun Osewa
+1  A: 

You may want to look into OrientDB.

Juha Syrjälä
A: 

I've so far found Tokyo Cabinet to be the simplest persistent Hash/Map to integrate into my code.

This abbreviated example, taken from the docs, shows how simple it is to save and retrieve data from a persistent Hash:

    // create the object
    HDB hdb = new HDB();
    // open the database
    hdb.open("casket.tch", HDB.OWRITER | HDB.OCREAT);
    // add item 
    hdb.put("foo", "hop");
    hdb.close();
Joel