views:

1021

answers:

5

Hi all,

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).

At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.

Question: Is this typical? Would I still see this scalability issue if I REMOVED indexing? What are the causes of this problem?

Thanks in advance.

Jbu

Notes: Each record consists of a few ints

+5  A: 

Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.

You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!

When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).

Richard T
A: 

Totally agree with @Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.

The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.

As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)

stephbu
+1  A: 

It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.

However, given that:

1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),

2/ You're only writing a few integers per record; and

3/ You're queries are only based on time queries,

you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).

If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).

This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.

This of course depends on my assumption of your storage being correct:

1/ You're writing records sequentially based on time.

2/ You only need to query on time ranges.

paxdiablo
You have even more assumptions in your proposal than you list! - not the least of which is that the guy is probably using an RDBMS for a good reason (of which there are many possible). One would require an RDBMS, for example, if one wanted to do a JOIN.
Richard T
In some situations this is actually a great idea. The trick is knowing when to use it. You have no idea if the person asking the question needs to do a JOIN.
Jacob
+1  A: 

Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).

Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.

SquareCog
Excellent point. Generally, I prefer to load a table, _then_ build the index(es), for this very reason. I will even unload and reload a table sometimes to help rebalance after a significant time in service where new data is being inserted (and in some cases, updated).
Richard T
You don't need to go quite that far, Richard, it's enough to drop and recreate the indexes. Though you might get some defragmentation from reloading, which can be a good thing.. but generally, I'd say just rebuild the indexes.
SquareCog
Yes, Dmitriy, yet some RDBMSes have "inherent" table structures and don't just index a heap. That is, the base table itself is a structure, just like an index - a built-in index - and such cases can benefit from reloading now and then.
Richard T
A: 

I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size

Hippiehunter