views:

457

answers:

8

Hi,

Is there a key value store that will give me the following:

  • Allow me to simply add and remove nodes and will redstribute the data automatically
  • Allow me to remove nodes and still have 2 extra data nodes to provide redundancy
  • Allow me to store text or images up to 1GB in size
  • Can store small size data up to 100TB of data
  • Fast (so will allow queries to be performed on top of it)
  • Make all this transparent to the client
  • Works on Ubuntu/FreeBSD or Mac
  • Free or open source

I basically want something I can use a "single", and not have to worry about having memcached, a db, and several storage components so yes, I do want a database "silver bullet" you could say.

Thanks

Zubair

Answers so far: MogileFS on top of BackBlaze - As far as I can see this is just a filesystem, and after some research it only seems to be appropriate for large image files

Tokyo Tyrant - Needs lightcloud. This doesn't auto scale as you add new nodes. I did look into this and it seems it is very fast for queries which fit onto a single node though

Riak - This is one I am looking into myself, but I don't have any results yet

Amazon S3 - Is anyone using this as their sole persistance layer in production? From what I have seen it seems to be used for storage of images as complex queries are too expensive

@shaman suggested Cassandra - definitely one I am looking into

So far it seems that there is no database or key value store that fulfills the criteria I mentioned, not even after offering a bounty of 100 points did the question get answered!

+1  A: 

I can suggest you two possible solutions:

1) Buy Amazon's service (Amazon S3). For 100 TB it will cost you 14 512$ monthly.
2) much cheaper solution:

Build two custom backblaze storage pods (link) and run a MogileFS on top of it.

Currently I'm investigating how to store petabytes of data using similar solutions, so if you find something interesting on that, please post you notes.

Shaman
Thanks, I'm checking this out. I'm also looking at Riak which is based on Erlang. What is the advantage of using a MogileFS based system over a distributed key value store?
Zubair
+1  A: 

Take a look at Tokyo Tyrant. It is a very lightweight, high-performance, replicating daemon exporting a Tokyo Cabinet key-value store to the network. I've heard good things about it.

David Schmitt
I have used Tokyo Cabinet AND the networklayer Tokyo Tyrant, but they are definitely not scalable as they need a layer on top called Lightcloud to perform the partitioning, and this is fixed partioning.
Zubair
Lightcloud is not *NEEDED* per se. It's just one way of doing so. There's nothing from stopping you from writing your own consistent hashing implementation atop Tokyo Tyrant.
z8000
@z8000. Yes, you are right, since Tokyo Cabinet does not provide dynamic partitioning it would need to be custom written. Thanks
Zubair
+1  A: 

You might want to take a look at MongoDB.

From what I can tell you're looking for a database/distrubuted filesystem mix, which might be hard or even impossible to find.

You might want to take a look at distributed filesystems like MooseFS or Gluster and keep your data as files. Both systems are fault-tolerant and distributed (you can put in and take out nodes as you like), and both are transparent to clients (built on top of FUSE) - you're using simple filesystem ops. This covers following features: 1), 2), 3), 4), 6), 7), 8). We're using MooseFS for digital movies storage with something aroung 1,5 PB of storage and upload/download is as fast as network setup allows (so the performance is I/O dependent, not protocol or implementation dependent). You won't have queries (feature 5) on your list), but you can couple such filesystem with something like MongoDB or even some search engine like Lucene (it has clustered indexes) to query data stored in filesystem.

Endru6
Yes, I'm looking more at MongoDb
Zubair
+3  A: 

Amazon S3 is a storage solution, not a database.

If you only need simple key/value your best bet would be to use Amazon SimpleDB in combination with S3. Large files are stored on S3, while meta data for searching is stored in SimpleDB. this gives you a horizontally scalable key/value system with direct access to S3.

Great Turtle
+2  A: 

From what I see in your question Project Voldemort seems to be the closest one. Have a look at their Design page.

The only problem I see is how will it handle huge files, and according to this thread, thing aren't all good. But you can always work around that fairly easily using files. In the end - this is the exact purpose of a file system. Have a look at the wikipedia list of file systems - the list is huge.

Emil Ivanov
Noone except its creators is using voldemort in production according to what I can find on the web
Zubair
+2  A: 

There's another solution, which seems to be exactly what you are looking for: The Apache Cassandra project: http://incubator.apache.org/cassandra/

At the moment twitter is switching to Cassandra from memcached+mysql cluster

Shaman
Yes, Cassandra is looking more attractive all the time.
Zubair
+2  A: 

HBase and HDFS together fulfil most of these requirements. HBase can be used to store and retrieve small objects. HDFS can be used to store large objects. HBase compacts small objects and stores them as larger ones on HDFS. Speed is relative - HBase is not as fast on random reads from disk as mysql (for example) - but is pretty fast serving reads from memory (similar to Cassandra). It has excellent write performance. HDFS, the underlying storage layer, is fully resilient to loss of multiple nodes. It replicates across racks as well allowing rack level maintenance. It's a Java based stack with Apache license - runs pretty much most OS.

The main weaknesses of this stack are less than optimal random disk read performance and lack of cross data center support (which is a work in progress).

Joydeep Sen Sarma
+2  A: 

You are asking too much from open source software.

If you have a couple hundred thousand dollars in your budget for some enterprise class software, there are a couple of solutions. Nothing is going to do what you want out of the box, but there are companies that have products which are close to what you are looking for.

"Fast (so will allow queries to be performed on top of it)"

If you have a key-value store, everything should be very fast. However the problem becomes that without an ontology or data schema built on top of the key-value store, you will end up going through the whole database for each query. You need an index containing the key for each "type" of data you want to store.

In this case, you can usually perform queries in parallel against all ~15,000 machines. The bottleneck is that cheap hard drives cap out at 50 seeks per second. If your data set fits in RAM, your performance will be extremely high. However, if the keys are stored in RAM but there is not enough RAM for the values to be stored, the system will goto disc on almost all key-value lookups. The keys are each located at random positions on the drive.

This limits you to 50 key-value lookups per second per server. Whereas when the key-value pairs are stored in RAM, it is not unusual to get 100k operations per second per server on commodity hardware (ex. Redis).

Serial disc read performance is however extremely high. I have seek drives goto 50 MB/s (800 Mb/s) on serial reads. So if you are storing values on disc, you have to structure the storage so that the values that need to be read from disc can be read serially.

That is the problem. You cannot get good performance on a vanilla key-value store unless you either store the key-value pairs completely in RAM (or keys in RAM with values on SSD drives) or if you define some type of schema or type system on top of the keys and then cluster the data on disc so that all keys of a given type can be retrieved easily through a serial disc read.

If a key has multiple types (for example if you have data-type inheritance relationships in the database), then the key will be an element of multiple index tables. In this case, you will have to make time-space trade offs to structure the values so that they can be read serially from disc. This entails storing redundant copies of the value for the key.

What you want is going to be a bit more advanced than a key-value store, especially if you intend to do queries. The problem of storing large files however is a non-problem. Pretend your system can keys upto 50 meg. Then you just break up a 1 gig file into 50 meg segments and associate a key to each segment value. Using a simple server it is straight forward to translate the part of the file you want into a key-value lookup operation.

The problem of achieving redundancy is more difficult. Its very easy to "fountain code" or "part file" the key-value table for a server, so that the server's data can be reconstructed at wire speed (1 Gb/s) onto a standby server, if a particular server dies. Normally, you can detect server death using a "heart beat" system which is triggered if the server does not respond for 10 seconds. It is even possible to key-value lookups against the part-file encoded key-value tables, but it is inefficient to do so but still gives you a backup for the event of server failure. A bigger issues it is almost impossible to keep the backup up to date and the data may be 3 minutes old. If you are doing lots of writes, the backup functionality is going to introduce some performance overhead, but the overhead will be negligible if your system is primarily doing reads.

I am not an expert on maintaining database consistency and integrity constraints under failure modes, so I am not sure what problems this requirement would introduce. If you do not have to worry about this, it greatly simplifies the design of the system and its requirements.

Fast (so will allow queries to be performed on top of it)

First, forget about joins or any operation that scales faster than n*log(n) when your database is this large. There are two things you can do to replace the functionality normally implemented with joins. You can either structure the data so that you do not need to do joins or you can "pre-compile" the queries you are doing and make a time-space trade-off and pre-compute the joins and store them for lookup in advance.

For semantic web databases, I think we will be seeing people pre-compiling queries and making time-space trade-offs in order to achieve decent performance on even modestly sized datasets. I think that this can be done automatically and transparently by the database back-end, without any effort on the part of the application programmer. However we are only starting to see enterprise databases implementing these techniques for relational databases. No open source product does it as far as I am aware and I would surprised if anyone is trying to do this for linked data in horizontally scalable databases yet.

For these types of systems, if you have extra RAM or storage space the best use of it is to pre-compute and store the result of common sub-queries for performance reasons, instead of adding more redundancy to the key-value store. Pre-compute results and order by the keys you are going to query against to turn an n^2 join into a log(n) lookup. Any query or sub-query that scales worse than n*log(n) is something whose results need to be performed and cached in the key-value store.

If you are doing a large number of writes, the cached sub-queries will be invalidated faster than they can be processed and there is no performance benefit. Dealing with cache invalidation for cached sub-queries is another intractable problem. I think a solution is possible, but I have not seen it.

Welcome to hell. You should not expect to get a system like this for free for another 20 years.

So far it seems that there is no database or key value store that fulfills the criteria I mentioned, not even after offering a bounty of 100 points did the question get answered!

You are asking for a miracle. Wait 20 years until we have open source miracle databases or you should be willing to pay money for a solution customized to your application's needs.

If I wrote one of these databases, it would be the kind of database that buys me my Gulfstream V and puts me on the Forbes Billionaire list after my company IPOs.

Brandon Smietana