views:

348

answers:

13

I don't really care if it's NoSQL or SQL based - as long as it uses int indexes (and stores them in RAM for fast searching) so I can find my data with simple queries based on criteria like user_id, lat, status, or other common int fields. The actual records can be stored on disk.

My first guess would be SQLite - but it moves slowly when dealing with a lot of concurrent writes.

Second, it also needs to be able to run in very small amounts of RAM for VPS with limited resources. This excludes MongoDB since it spreads to fill all available RAM (well, the diskcache does really). I also can't use MySQL Innodb since it uses about 100MB of RAM just to load and MyIsam doesn't support ACID.

So are their any RDBMS or NoSQL databases that meet all four requirements?

Update: When I say small databases, I mean databases that only use 8-60MB of RAM. I understand that actual data will increase this but most of my datasets are usually under 1GB for the smaller sites with about 5MB of indexes that would need to be stored in RAM. So an ideal database would use about 30MB when running with a fully index dataset of about 1GB. Take this site for example, I doubt the whole stackoverflow site takes much more than 1GB to store.

Update: To clarify, a setup would ideally store all data on disk. However, it would also keep column indexes in RAM (just ints after all) which would contain the needed pointers to data on the disk. This would avoid two things 1) keeping unneeded rows in memory like redis and 2) keeping indexes on the hard drive slowing searches (SQLite).

An example is MySQL which can be configured to only keep primary and secondary indexed columns in memory and all other data on the hard drive. However, MySQL either uses 100MB extra RAM just to add InnoDB or you forgo ACID compliance and stick with Myisam which is not transaction safe.

Again, the target is systems that are limited in RAM and can't handle more than a couple Megabytes of cached indexes - but that still need to allow frequent writes/updates of normally small data sets in a safe manner.

Update: apparently finding something that meets all these requirements is a bit much. So, starting with the most important features let me list them in descending importance.

  1. Low memory usage
  2. Indexes (or something to mimic them)
  3. Handles concurrent writes
  4. ACID

Expanding on #1, it is more important that data can be written than that reads are fast. Which also means that the amount of RAM should not have any affect on the amount of data that can be stored.

Expanding on #2, ideally (given how small they are) indexes should be stored in RAM since indexes should be nothing more than int values that are compared to filter results before actually accessing the disk for the data.

A: 

Any particular reason to not use raw btrees?

Joshua
Personal ignorance, though you could fix that for me.
Xeoncross
http://en.wikipedia.org/wiki/B-tree
Joshua
+3  A: 

For high-volume concurrent inserts and updates, a hashed table beats b-trees hands-down. The tradeoff is that with a hashed table you cannot store the data physically in a meaningful sorted order, e.g. sorted by zip, to make selecting and ordering by zipcode faster; rather the hashFunction( primaryKey ) determines the location where the record is placed in the file, and if you want to extract all records with a particular zipcode, those rows won't be physically contiguous. But you'll have to define what you mean by "a lot of concurrent writes".

EDIT: just wanted to underscore the fact that with a hashed table, you don't need to keep the Primary Keys in RAM because the hash-key function is fed the key value and it returns the record location to your seek routine.

So keys-in-RAM per se need not be one of your requirements. Speedy retrieval is the requirement.

Tim
Well, I don't have a hard number - but perhaps 5 or so every 100ms.
Xeoncross
A properly sized hash-table so that there are few records per page would be able to handle this kind of load. You don't need an index for the PK. If you have 50 inserts/updates per second, you'd want to keep the number of indexed columns to a minimum and/or use composite indexes, the goal being to avoid collisions when writing the indices.
Tim
Now if only I could build one...
Xeoncross
@Xeoncross -- there are open source hash-table libraries available. It might not be as difficult as you envision. http://en.wikipedia.org/wiki/Hash_table
Tim
Keys-in-RAM mean any of the INDEX keys that are needed to quickly find the correct values in the massive store of data. For example, finding things like "active" user accounts, "recent" comments, and lng/lat coordinates are all INT column values that should be stored in RAM to prevent searching through the entire database to find the right rows to fetch. MongoDB and MySQL are good examples of databases that can do this. SQLite and CouchDB are examples of ones that can't.
Xeoncross
My point was that you don't need to have keys in RAM to avoid having to "search through the entire database to find the right rows to fetch". There's more than one way to solve your problem. Requiring keys in RAM rather than requiring speedy retrieval however it is implemented is to insist on one implementation when another might work just as well or better. With a hashed table, a single read can fetch all the customer primary keys in zipcode 19004, say, and once you have those PKs in RAM retrieval of main customer records is instant.
Tim
True, storing the needed keys in a single record would acomplish the same thing. However, the problem is that now instead of using indexes stored in RAM to find the right records - you have to create a single record for each possible lookup sequence. A "recent comments" record since you can't sort by timestamp int. A "active comemnts" record since you can't store by "status" boolean. A comments for blog post 57 record since you can't filter by "post_id", etc... This means that you have to have much more application logic to mimic the value that indexed keys already add to an application.
Xeoncross
You'd have a hashed-table for each column in your main table that you wanted to index. If main table is CUSTOMERS and index is on ZIP you'd have a table CUSTOMER_ZIP whose PKs are zipcodes and each zipcode's record contains the associated CUSTOMER-ids. If your DBMS handles indexes transparently for you when you issue a query, your application requires less procedural code. If your DBMS doesn't, you have to write the code. RAM-storage simply eliminates disk i/o; disk access is orders of magnitudes slower than RAM-access. RAM-storage of indices does not ease the coding burden.
Tim
+5  A: 

I'd say no: you have somewhat contradictory requirements

  • In memory = data cache (which is data and indexes). Covering indexes will need more RAM. But you want small RAM footprint

  • Write volumes depend on underlying IO stack, really. Eg Write Ahead Logging, required for ACID in some implementations

What would be most important, and what is priority of the rest?

gbn
Sorry, I mean the smallest *overhead RAM usage*.
Xeoncross
Agreed, these requirements seem completely contradictory. How are you going to have "high write volumes", but only use RAM that's 3% of the data size? Oh yeah and you want it to be ACID, so the data has to be safely committed to disk, but it can't have any RAM space? This whole thing seems like some kind of mythical database structure. Without a *much more specific* set of requirements, you're not going to find an answer here.
Gates VP
To my knowledge these requirements are not contradictorily. If a database needs to use RAM while writing data that is fine, they also need RAM to run anyway. Also, correct me if I'm wrong, but I believe there are methods that a single database file could be setup in append mode to allow multiple clients to pass all their writes to a single process which keeps the file open and constantly appends data. Data could even be split up among many files like MongoDB.
Xeoncross
+2  A: 

Redis may be an option, as it store indexes (keys actually) in ram, is very compact on ram usage and with vm enabled values will go to disk. But it is a key value store, so modeling might be a bit different. It is also very fast on writes and reads.

CouchDB is also a option with low memory usage and it's a great database, but might not fill all your requirements (ACID and keys in RAM).

diogok
+1  A: 

I'd try to squeeze the performance you need out of SQLite.

  • Write performance is good if you do many writes per transaction.
  • 3.7 introduces write-ahead logging (WAL) which, when enabled, can improve performance in many cases.
Dave R
We need to know whether the OP's projected 50 writes per second will be from 50 separate clients or from only a couple of clients with batches that can be wrapped in a transaction.
Tim
separate clients which would not allow batch transactions at the application level.
Xeoncross
I'm not sure what you mean by "separate".
Tim
+3  A: 

You can take a look at Firebird

Hugues Van Landeghem
+1  A: 

OrientDB: it's the faster in insert. On common hw insert 150,000 documents per second. Supports indexes, ACID transactions and even it's a NoSQL dbms supports the SQL with extensions to treat trees and graphs of documents. Open source and always free (Apache 2 license).

Lvca
It only seems to support Java at this time - though they are working on documenting the protocols for other languages. Also, having something critical like a database built on Java scares me. I'd much rather stick with solid languages like c++. However, this database looks very promising in everything it's trying to accomplish.
Xeoncross
A: 

You didn't say which OS you're targeting.

If it's Windows, you might consider ESENT - an ISAM database that comes as part of the OS (not a lot of people know about it).

Here's a blog post with an intro in C++.

Vijay Patel
Linux, I can't imagine running a production server on Windows. It's bad enough trying to use it day-to-day. However, if the database works on both then that's even better!
Xeoncross
Kind of funny considering this site is run by it...
Xeoncross
+1  A: 

I recommend BerkeleyDB. It doesn't use SQL at all, and you have to specify your lookup method explicitly (sequential records, btree, hash). However, using secondary databases, you do have support for indices, and you can set the shared memory consumption (cache size) per database - the default cache size is 256KB. It supports ACID transactions.

Martin v. Löwis