views:

257

answers:

10

I need to store up to tens or even hundreds of millions of pieces of data on-disk. Each piece of data contains information like:

id=23425
browser=firefox
ip-address=10.1.1.1
outcome=1.0

New pieces of data may be added at the rate of up-to 1 per millisecond.

So its a relatively simple set of key-value pairs, where the values can be strings, integers, or floats. Occasionally I may need to update the piece of data with a particular id, changing the flag field from 0 to 1. In other words, I need to be able to do random key lookups by id, and modify the data (actually only the floating point "outcome" field - so I'll never need to modify the size of the value).

The other requirement is that I need to be able to stream this data off disk (the order isn't particularly important) efficiently. This means that the hard disk head should not need to jump around the disk to read the data, rather it should be read in consecutive disk blocks.

I'm writing this in Java.

I've thought about using an embedded database, but DB4O is not an option as it is GPL and the rest of my code is not. I also worry about the efficiency of using an embedded SQL database, given the overhead of translating to and from SQL queries.

Does anyone have any ideas? Might I have to build a custom solution to this (where I'm dealing directly with ByteBuffers, and handling the id lookup)?

A: 

I think you'd have a lot more success writing something that caches the most active records in memory and queues data changes as a low priority insert into the DB.

I understand there's a slight increase in IO using this method but if you're talking about millions of records I think it would still be faster because any search algorithm you create is going to be greatly outperformed by a a full fledged database engine.

Spencer Ruport
A: 

You could try Berkley DB which is now owned by Oracle. They have Open Source and Commercial licenses. It uses a Key/Value model (with an option to create indexes if other forms of queries are required). There is a pure Java version and a native version with Java bindings.

Michael Barker
I hope I can find something free, unfortunately Berkeley DB is not unless I'm willing to GPL my code, which isn't an option.
sanity
A: 

One trick you could use is to improve throughput with an RDBMS is to batch your database insert / updates. For instance, I system I worked on did inserts/updates in batches of up to 3,000. The down side is that the SQL needs to be be more complex and needs to be tuned.

Stephen C
OK down-voter, why is this a bad idea? I can assure you that it did work.
Stephen C
A: 

http://www.zentus.com/sqlitejdbc/

SQLite database (public domain), JDBC connector with BSD license, native for a whole bunch of platforms (OSX, Linux, Windows), emulation for the rest.

Marian
+2  A: 

How about H2? The License should work for you.

  • You can use H2 for free. You can integrate it into your application (including commercial applications), and you can distribute it.
  • Files containing only your code are not covered by this license (it is 'commercial friendly').
  • Modifications to the H2 source code must be published.
  • You don't need to provide the source code of H2 if you did not modify anything.

I get

1000000 insert in 22492ms (44460.252534234394 row/sec)

100000 updates in 9565ms (10454.783063251438 row/sec)

from

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.Random;


/**
 * @author clint
 *
 */
public class H2Test {

  static int testrounds = 1000000;

  public static void main(String[] args) {
    try {
      Class.forName("org.h2.Driver");

    Connection conn = DriverManager.
        getConnection("jdbc:h2:/tmp/test.h2", "sa", "");
    // add application code here
    conn.createStatement().execute("DROP TABLE IF EXISTS TEST");
    conn.createStatement().execute("CREATE TABLE IF NOT EXISTS TEST(id INT PRIMARY KEY, browser VARCHAR(64),ip varchar(16), outcome real)"); 
    //conn.createStatement().execute("CREATE INDEX IDXall ON TEST(id,browser,ip,outcome");


    PreparedStatement ps = conn.prepareStatement("insert into TEST (id, browser, ip, outcome) values (?,?,?,?)");
    long time = System.currentTimeMillis();
    for ( int i = 0; i < testrounds; i++ ) {
      ps.setInt(1,i);
      ps.setString(2,"firefox");
      ps.setString(3,"000.000.000.000");
      ps.setFloat(4,0);
      ps.execute();
    }
    long last = System.currentTimeMillis() ;
    System.out.println( testrounds + " insert in " + (last - time) + "ms (" + ((testrounds)/((last - time)/1000d)) + " row/sec)" );

    ps.close();
    ps = conn.prepareStatement("update TEST set outcome = 1 where id=?");
    Random random = new Random();
    time = System.currentTimeMillis();

    /// randomly updadte 10% of the entries
    for ( int i = 0; i < testrounds/10; i++ ) {
      ps.setInt(1,random.nextInt(testrounds));
      ps.execute();
    }

    last = System.currentTimeMillis();
    System.out.println( (testrounds/10) + " updates in " + (last - time) + "ms (" + ((testrounds/10)/((last - time)/1000d)) + " row/sec)" );

    conn.close();

    } catch (ClassNotFoundException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    } catch (SQLException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
  }

}
Clint
A: 

I'd also take a look to see if there's anything existing based on either EHCache or JCS that might help.

Gwyn Evans
A: 

You can use Apache Derby (or JavaDB) which is bundled with JDK. However, if a DBMS doesn't provide the required speed you may implement a specific file structure yourself. If just exact key lookup is required, you may use a hash-file to implement it. The hash-file is the fastest file structure for such requirements (much faster than general purpose file structures such as B-Trees and grids which are used in DBs). It also provides acceptable streaming efficiency.

+1  A: 

JDBM is a great embedded database for Java (and not as encumbered with licensing as the Java version of Berkley). It would be worth trying. If you don't need ACID guarantees (i.e. you are OK with the database getting corrupted in the event of a crash), turn off the transaction manager (significantly increases speed).

Kevin Day
A: 

In the end I decided to log the data to disk as it comes in, and also keep it in memory where I can update it. After a period of time I write the data out to disk and delete the log.

sanity
A: 

Have you taken a look at Oracle's 'TimesTen' database? Its an in-memory db that is supposed to be very high-performance. Don't know about costs/licenses, etc, but take a look at Oracles site and search for it. Eval download should be available.

andora