views:

499

answers:

6

Or should I use a different hammer to fix this problem.

I've got a very simple use-case for storing data, effectively a sparse matrix, which I've attempted to store in a SQLite database. I've created a table:

create TABLE data ( id1 INTEGER KEY, timet INTEGER KEY, value REAL )

into which I insert a lot of data, (800 elements every 10 minutes, 45 times a day), most days of the year. The tuple of (id1,timet) will always be unique.

The timet value is seconds since the epoch, and will always be increasing. The id1 is, for all practical purposes, a random integer. There is probably only 20000 unique ids though.

I'd then like to access all values where id1==someid or access all elements where timet==sometime. On my tests using the latest SQLite via the C interface on Linux, a lookup for one of these (or any variant of this lookup) takes approximately 30 seconds, which is not fast enough for my use case.

I tried defining an index for the database, but this slowed down insertion to completely unworkable speeds (I might have done this incorrectly though...)

The table above leads to very slow access for any data. My question is:

  • Is SQLite completely the wrong tool for this?
  • Can I define indices to speed things up significantly?
  • Should I be using something like HDF5 instead of SQL for this?

Please excuse my very basic understanding of SQL!

Thanks

I include a code sample that shows how the insertion speed slows to a crawl when using indices. With the 'create index' statements in place, the code takes 19 minutes to complete. Without that, it runs in 18 seconds.


#include <iostream>
#include <sqlite3.h>

void checkdbres( int res, int expected, const std::string msg ) 
{
  if (res != expected) { std::cerr << msg << std::endl; exit(1); } 
}

int main(int argc, char **argv)
{
  const size_t nRecords = 800*45*30;

  sqlite3      *dbhandle = NULL;
  sqlite3_stmt *pStmt = NULL;
  char statement[512];

  checkdbres( sqlite3_open("/tmp/junk.db", &dbhandle ), SQLITE_OK, "Failed to open db");

  checkdbres( sqlite3_prepare_v2( dbhandle, "create table if not exists data ( issueid INTEGER KEY, time INTEGER KEY, value REAL);", -1, & pStmt, NULL ), SQLITE_OK, "Failed to build create statement");
  checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute insert statement" );
  checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize insert");
  checkdbres( sqlite3_prepare_v2( dbhandle, "create index issueidindex on data (issueid );", -1, & pStmt, NULL ), SQLITE_OK, "Failed to build create statement");
  checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute insert statement" );
  checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize insert");
  checkdbres( sqlite3_prepare_v2( dbhandle, "create index timeindex on data (time);", -1, & pStmt, NULL ), SQLITE_OK, "Failed to build create statement");
  checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute insert statement" );
  checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize insert");

  for ( size_t idx=0; idx < nRecords; ++idx)
  {
    if (idx%800==0)
    {
      checkdbres( sqlite3_prepare_v2( dbhandle, "BEGIN TRANSACTION", -1, & pStmt, NULL ), SQLITE_OK, "Failed to begin transaction");
      checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute begin transaction" );
      checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize begin transaction");
      std::cout << "idx " << idx << " of " << nRecords << std::endl;
    }

    const size_t time = idx/800;
    const size_t issueid = idx % 800;
    const float value = static_cast<float>(rand()) / RAND_MAX;
    sprintf( statement, "insert into data values (%d,%d,%f);", issueid, (int)time, value );
    checkdbres( sqlite3_prepare_v2( dbhandle, statement, -1, &pStmt, NULL ), SQLITE_OK, "Failed to build statement");
    checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute insert statement" );
    checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize insert");

    if (idx%800==799)
    {
      checkdbres( sqlite3_prepare_v2( dbhandle, "END TRANSACTION", -1, & pStmt, NULL ), SQLITE_OK, "Failed to end transaction");
      checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute end transaction" );
      checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize end transaction");
    }
  }

  checkdbres( sqlite3_close( dbhandle ), SQLITE_OK, "Failed to close db" ); 
}

+8  A: 

Are you inserting all of the 800 elements at once? If you are, doing the inserts within a transaction will speed up the process dramatically.

See http://www.sqlite.org/faq.html#q19

SQLite can handle very large databases. See http://www.sqlite.org/limits.html

Robert Harvey
I'm already doing each set of 45*800 inserts as a single transaction to speed up the inserts. This does make a big difference for building the DB, giving me about an insert of 45*800 elements in about 300ms. My issue is still with the querying though.
Brian O'Kennedy
Are you using the needed index? The index will bring the read time down to O(log n) from O(n). How to best retrieve the data will depend on how much data you are retrieving at once. Are you retrieving a whole set, or just single records individually?
Robert Harvey
I tried the index by doing something like: CREATE INDEX INDEX1 on data ( id1 );and CREATE INDEX INDEX2 on data ( timet );However, this meant that when I built the db up, each subsequent set of insert statements took longer and longer, and the process finally crawled to a complete halt. Are these create index statements correct/sensible?
Brian O'Kennedy
They are if you have to query on both id and time. I'll have to give this some thought.
Robert Harvey
Would it be practical to put all 800 id values into a single record, indexed by time? How sparse is the data, i.e. are you always writing all 800 values?
Robert Harvey
The figures are all roughly representative of what I'd like to do. I write <approximately> 800 elements every 10 minutes, not always the same +- 800 elements either. Perhaps grouping each set of these into its own record/table is the correct thing to do, but this is where my sql experience fails me! ;)
Brian O'Kennedy
+2  A: 

I can't tell from your specs, but if the ID field is always increasing, and the time field includes YYYYMMDD for uniqueness and is also always increasing, and you're doing either ID searches or time searches, then the simplest non-database solution would be to simply append all records to a fixed-field text or binary file (since they're being generated in "sorted" order) and use code to do a binary search for the desired records (eg, find the first record with the ID or time of interest, then sequentially step through the desired range).

joe snyder
The time value is a time_t, always increasing as I add values, with each chunk of 800 values being at the same time. The id1 is a (almost) random int. I was planning something like this using hdf5 for the binary file storage, but hoped sqlite might help. Especially, that sqlite has transactional writes to the file, since I will have multiple readers accessing the file with occasional writes. As an added complexity, the file is kept on NFS storage (but I have control over the linux kernels accessing the storage, and proper nfs locking is supported). http://www.sqlite.org/faq.html#q5
Brian O'Kennedy
+7  A: 

I've looked at your code, and I think you might be overdoing it with the prepare and finalize statements. I am by no means an SQLite expert, but there's got to be significant overhead in preparing a statement each and every time through the loop.

Quoting from the SQLite website:

After a prepared statement has been evaluated by one or more calls to sqlite3_step(), it can be reset in order to be evaluated again by a call to sqlite3_reset(). Using sqlite3_reset() on an existing prepared statement rather creating a new prepared statement avoids unnecessary calls to sqlite3_prepare(). In many SQL statements, the time needed to run sqlite3_prepare() equals or exceeds the time needed by sqlite3_step(). So avoiding calls to sqlite3_prepare() can result in a significant performance improvement.

http://www.sqlite.org/cintro.html

In your case, rather than preparing a new statement each time, you could try binding new values to your existing statement.

All this said, I think the indexes might be the actual culprit, since the time keeps increasing as you add more data. I am curious enough about this where I plan to do some testing over the weekend.

Robert Harvey
I can move the "BEGIN TRANSACTION" and "END TRANSACTION" prepare/finalize to outside the loop, but surely the main insert statement has to be re-done every time, since the parameters are different every time?Anyway, without the index tables being present, the entire operation flies along - the addition of those somehow slows it all down.Thanks for pointing out the overuse though.
Brian O'Kennedy
Ah, I see you just updated your comment! thanks for the help, much appreciated.
Brian O'Kennedy
+1 Definitely build the query once, and use binding.
maxwellb
I've modified my code to build the query once and bind for subsequent inserts. Runtime down from 19 seconds to 15 seconds for the version without an index, so that certainly helps. However, the index version is still slow.
Brian O'Kennedy
I have one more idea, which I will post shortly.
Robert Harvey
+2  A: 

Consider using a table for new inserts of the given day, without an index. Then, at the end of each day, run a script which will:

  1. Insert new values from new_table into master_table
  2. Clear the new_table for next day of processing

If you can do lookups on historical data in O(log n), and lookups on today's data in O(n), this should provide a nice compromise.

maxwellb
That's a good idea, provided there's some downtime available to do this.
Robert Harvey
That is certainly something I can do, consolidating at the end of every day - I'll have a think about implementing something like that.
Brian O'Kennedy
@Robert, there should be some downtime. The original question states that he's inserting rows every 10 minutes, 45 times a day. Meaning there should be over 16 hours of downtime for the script to run.
T Pops
That's why I suggested this. Precisely because of the 10 minutes x 45.
maxwellb
I also use the term "script" loosely. This could of course, be a compiled program.
maxwellb
+3  A: 

Since we know that capturing your data is fast when there is no index on the table, what might actually work is this:

  1. Capture the 800 values in a temporary table with no index.

  2. Copy the records to the master table (containing indexes) using the form of INSERT INTO that takes a SELECT statement.

  3. Delete the records from the temporary table.

This technique is based on the theory that the INSERT INTO that takes a SELECT statement is faster than executing individual INSERTs.

Step 2 can be executed in the background by using the Asynchronous Module, if it still proves to be a bit slow. This takes advantage of the bits of downtime between captures.

Robert Harvey
Interesting idea that's worth a shot. I'll be away from my dev pc for the weekend, so will have a go on monday and get back to you. thanks.
Brian O'Kennedy
+1  A: 

Answering my own question just as a place to put some details:

It turns out (as correctly suggested above) that the index creation is the slow step, and every time I do another transaction of inserts, the index is updated which takes some time. My solution is to: (A) create the data table (B) insert all my historical data (several years worth) (C) create the indexes

Now all lookups etc are really fast and sqlite does a great job. Subsequent daily updates now take a few seconds to insert only 800 records, but that is no problem since it only runs every 10 minutes or so.

Thanks to Robert Harvey and maxwellb for the help/suggestions/answers above.

Brian O'Kennedy