views:

218

answers:

3

I'm developing a database that holds large scientific datasets. Typical usage scenario is that on the order of 5GB of new data will be written to the database every day; 5GB will also be deleted each day. The total database size will be around 50GB. The server I'm running on will not be able to store the entire dataset in memory.

I've structured the database such that the main data table is just a key/value store consisting of a unique ID and a Value.

Queries are typically for around 100 consecutive values, eg. SELECT Value WHERE ID BETWEEN 7000000 AND 7000100;

I'm currently using MySQL / MyISAM, and these queries take on the order of 0.1 - 0.3 seconds, but recently I've come to realize that MySQL is probably not the optimal solution for what is basically a large key/value store.

Before I start doing lots of work installing the new software and rewriting the whole database I wanted to get a rough idea of whether I am likely to see a significant performance boost when using a NoSQL DB (e.g. Tokyo Tyrant, Cassandra, MongoDB) instead of MySQL for these types of retrievals.

Thanks

+2  A: 

I would expect Cassandra to do better where the dataset does not fit in memory than a b-tree based system like TC, MySQL, or MongoDB. Of course, Cassandra is also designed so that if you need more performance, it's trivial to add more machines to support your workload.

jbellis
+2  A: 

I use MongoDB in production for a write intensive operation where I do well over the rates you are referring to for both WRITE and READ operations, the size of the database is around 90GB and a single instance (amazon m1.xlarge) does 100QPS I can tell you that a typical key->value query takes about 1-15ms on a database with 150M entries, with query times reaching the 30-50ms time under heavy load. at any rate 200ms is way too much for a key/value store.

If you only use a single commodity server I would suggest mongoDB as it quite efficient and easy to learn if you are looking for a distributed solution you can try any Dynamo clone: Cassandra (Facebook) or Project Volemort (LinkedIn) being the most popular. keep in mind that looking for strong consistency slows down these systems quite a bit.

Asaf
Thanks -am running some benchmarks now with MongoDB, Tokyo Tyrant and Cassandra. I am definitely seeing vast improvements in query times. However, fyi bulk inserts are proving not quite so fast (compared to MySQL LOAD INFILE).
Pete W
+1  A: 

Please consider also OrientDB. It uses indexes with RB+Tree algorithm. In my tests with 100GB of database reads of 100 items took 0.001-0.015 seconds on my laptop, but it depends how the key/value are distributed inside the index.

To make your own test with it should take less than 1 hour.

One bad news is that OrientDB not supports a clustered configuration yet (planned for September 2010).

Lvca