ansaurus

Question

Answer 1

+8 A:

Are you inserting all of the 800 elements at once? If you are, doing the inserts within a transaction will speed up the process dramatically.

See http://www.sqlite.org/faq.html#q19

SQLite can handle very large databases. See http://www.sqlite.org/limits.html

Robert Harvey 2010-07-01 19:22:06

I'm already doing each set of 45*800 inserts as a single transaction to speed up the inserts. This does make a big difference for building the DB, giving me about an insert of 45*800 elements in about 300ms. My issue is still with the querying though.

Brian O'Kennedy 2010-07-01 19:33:44

Are you using the needed index? The index will bring the read time down to O(log n) from O(n). How to best retrieve the data will depend on how much data you are retrieving at once. Are you retrieving a whole set, or just single records individually?

Robert Harvey 2010-07-01 19:35:51

I tried the index by doing something like: CREATE INDEX INDEX1 on data ( id1 );and CREATE INDEX INDEX2 on data ( timet );However, this meant that when I built the db up, each subsequent set of insert statements took longer and longer, and the process finally crawled to a complete halt. Are these create index statements correct/sensible?

Brian O'Kennedy 2010-07-01 19:39:26

They are if you have to query on both id and time. I'll have to give this some thought.

Robert Harvey 2010-07-01 19:47:27

Would it be practical to put all 800 id values into a single record, indexed by time? How sparse is the data, i.e. are you always writing all 800 values?

Robert Harvey 2010-07-01 21:03:30

The figures are all roughly representative of what I'd like to do. I write <approximately> 800 elements every 10 minutes, not always the same +- 800 elements either. Perhaps grouping each set of these into its own record/table is the correct thing to do, but this is where my sql experience fails me! ;)

Brian O'Kennedy 2010-07-02 09:01:21

Answer 2

+2 A:

I can't tell from your specs, but if the ID field is always increasing, and the time field includes YYYYMMDD for uniqueness and is also always increasing, and you're doing either ID searches or time searches, then the simplest non-database solution would be to simply append all records to a fixed-field text or binary file (since they're being generated in "sorted" order) and use code to do a binary search for the desired records (eg, find the first record with the ID or time of interest, then sequentially step through the desired range).

joe snyder 2010-07-01 20:19:41

The time value is a time_t, always increasing as I add values, with each chunk of 800 values being at the same time. The id1 is a (almost) random int. I was planning something like this using hdf5 for the binary file storage, but hoped sqlite might help. Especially, that sqlite has transactional writes to the file, since I will have multiple readers accessing the file with occasional writes. As an added complexity, the file is kept on NFS storage (but I have control over the linux kernels accessing the storage, and proper nfs locking is supported). http://www.sqlite.org/faq.html#q5

Brian O'Kennedy 2010-07-01 20:39:55

Answer 3

+7 A:

I've looked at your code, and I think you might be overdoing it with the prepare and finalize statements. I am by no means an SQLite expert, but there's got to be significant overhead in preparing a statement each and every time through the loop.

Quoting from the SQLite website:

After a prepared statement has been evaluated by one or more calls to sqlite3_step(), it can be reset in order to be evaluated again by a call to sqlite3_reset(). Using sqlite3_reset() on an existing prepared statement rather creating a new prepared statement avoids unnecessary calls to sqlite3_prepare(). In many SQL statements, the time needed to run sqlite3_prepare() equals or exceeds the time needed by sqlite3_step(). So avoiding calls to sqlite3_prepare() can result in a significant performance improvement.

http://www.sqlite.org/cintro.html

In your case, rather than preparing a new statement each time, you could try binding new values to your existing statement.

All this said, I think the indexes might be the actual culprit, since the time keeps increasing as you add more data. I am curious enough about this where I plan to do some testing over the weekend.

Robert Harvey 2010-07-02 15:01:58

I can move the "BEGIN TRANSACTION" and "END TRANSACTION" prepare/finalize to outside the loop, but surely the main insert statement has to be re-done every time, since the parameters are different every time?Anyway, without the index tables being present, the entire operation flies along - the addition of those somehow slows it all down.Thanks for pointing out the overuse though.

Brian O'Kennedy 2010-07-02 15:18:10

Ah, I see you just updated your comment! thanks for the help, much appreciated.

Brian O'Kennedy 2010-07-02 15:19:01

+1 Definitely build the query once, and use binding.

maxwellb 2010-07-02 15:26:33

I've modified my code to build the query once and bind for subsequent inserts. Runtime down from 19 seconds to 15 seconds for the version without an index, so that certainly helps. However, the index version is still slow.

Brian O'Kennedy 2010-07-02 15:34:55

I have one more idea, which I will post shortly.

Robert Harvey 2010-07-02 15:40:13

Answer 4

+2 A:

Consider using a table for new inserts of the given day, without an index. Then, at the end of each day, run a script which will:

Insert new values from new_table into master_table
Clear the new_table for next day of processing

If you can do lookups on historical data in O(log n), and lookups on today's data in O(n), this should provide a nice compromise.

maxwellb 2010-07-02 15:25:34

That's a good idea, provided there's some downtime available to do this.

Robert Harvey 2010-07-02 15:28:27

That is certainly something I can do, consolidating at the end of every day - I'll have a think about implementing something like that.

Brian O'Kennedy 2010-07-02 15:36:28

@Robert, there should be some downtime. The original question states that he's inserting rows every 10 minutes, 45 times a day. Meaning there should be over 16 hours of downtime for the script to run.

T Pops 2010-07-02 15:37:35

That's why I suggested this. Precisely because of the 10 minutes x 45.

maxwellb 2010-07-02 15:40:49

I also use the term "script" loosely. This could of course, be a compiled program.

maxwellb 2010-07-02 15:42:33

Answer 5

+3 A:

Since we know that capturing your data is fast when there is no index on the table, what might actually work is this:

Capture the 800 values in a temporary table with no index.
Copy the records to the master table (containing indexes) using the form of INSERT INTO that takes a SELECT statement.
Delete the records from the temporary table.

This technique is based on the theory that the INSERT INTO that takes a SELECT statement is faster than executing individual INSERTs.

Step 2 can be executed in the background by using the Asynchronous Module, if it still proves to be a bit slow. This takes advantage of the bits of downtime between captures.

Robert Harvey 2010-07-02 15:45:11

Interesting idea that's worth a shot. I'll be away from my dev pc for the weekend, so will have a go on monday and get back to you. thanks.

Brian O'Kennedy 2010-07-02 15:47:53

Answer 6

+1 A:

Answering my own question just as a place to put some details:

It turns out (as correctly suggested above) that the index creation is the slow step, and every time I do another transaction of inserts, the index is updated which takes some time. My solution is to: (A) create the data table (B) insert all my historical data (several years worth) (C) create the indexes

Now all lookups etc are really fast and sqlite does a great job. Subsequent daily updates now take a few seconds to insert only 800 records, but that is no problem since it only runs every 10 minutes or so.

Thanks to Robert Harvey and maxwellb for the help/suggestions/answers above.

Brian O'Kennedy 2010-07-05 08:33:49

ansaurus

tags:

views:

answers:

Can SQLite handle 90 million records?

related questions