views:

928

answers:

7

Hi,

We're developing a really big project and I was wondering if anyone can give me some advice about what DB backend should we pick.

Our system is compound by 1100 electronic devices that send a signal to a central server and then the server stores the signal info (the signal is about 35 bytes long). How ever these devices will be sending about 3 signals per minute each, so if we do de numbers, that'll be 4.752.000 new records/day on the database, and a total of 142.560.000 new records/month.

We need a DB Backend that is lighting fast and reliable. Of course we need to do some complex data mining on that DB. We're doing some research on the MongoDB/Cassandra/Redis/CouchDB, however the documentation websites are still on early stages.

Any help? Ideas?

Thanks a lot!

+1  A: 

I've used MongoDB from Incanter and have liked it. Although I can't speak to the speed with such large datasets, Clojure (which Incanter is based on) is very reliable in terms of transaction management. Incanter also provides some great analysis tools, so if you're planning on analyzing all of that data, MongoDB + Incanter could be a powerful combination.

allie
Clojure has native support of *software transactional memory*, not *database* transactions (let alone distributed database transactions).
+1  A: 

So you are storing data in a central db for datamining? No online transaction processing?

I don't think that MongoDB does a good job when it comes to durability. See http://nosql.mypopescu.com/post/392868405/mongodb-durability-a-tradeoff-to-be-aware-of .

Maybe you can use analytics db Infobright, it has a community edition: http://www.infobright.org/ ?

TTT
Thanks for the reply, I don't need online transaction processing only the storing for datamining. I'll checkt out infobright and let you know.
Juanda
+1  A: 

If you're liking the look of Cassandra for its designed-from-the-start ability to scale horizontally, tune consistency against availability and such, then you may also want to look at Riak, which has a similar feature set but a different approach.

Evan
I wasn't aware of Riak. I'll give that a try and let you know. Thanks for your reply!
Juanda
+3  A: 

~3000 signals/minute = 50 writes/s which any of these systems will be able to handle easily.

Cassandra will probably work best as your data set grows larger than memory, though, and the Hadoop integration will help with your data mining.

jbellis
Thanks for your reply, I'll check the Hadoop more deeply because the truth is that I'm not familiar with it. Thanks a lot!
Juanda
+1  A: 

You are looking for a datastore that can allow "lightning fast" writes (data persisted on disk), and the data-mining will occur at a later stage (this is the READ cycle). Also, considering the numbers you state, it turns out you will collect all of 159MB of information per day, or approx 5GB per month.

In this case, why not look at Redis.

You could always archive the daily Redis data file, and refer to it later (if you have concerns of loading 5GB or greater amount of RAM space, then you this archiving could be a workaround)

Redis is rather fast, based on the numbers published on that site. Hope this helps. Kiran

Kiran Subbaraman
+2  A: 

CouchDB is very reliable, provides excellent durability, and you'll experience very low CPU load. It's also excellent at replicating between multiple nodes, either on-demand or continuously.

Thanks to its replication abilities and RESTful API (it uses HTTP for its API) you can scale horizontally pretty easily using mature tools. (Nginx or Apache for reverse proxying, HTTP load balancers, etc.)

You write map/reduce functions in JavaScript to precompute queries. The results are built up incrementally on disk which means they only neeed to be computed once per signal. In other words, queries can be really fast because it only has to do calculations on the signal data recorded since the last time you ran the query.

CouchDB trades disk space for performance, so you can expect to use a lot of disk space. Your queries can be lightning fast and conserve disk space if you implement them properly.

Give CouchDB a try.

Check out Why Large Hadron Collider Scientists are Using CouchDB and CouchDB at the BBC as a fault tolerant, scalable, multi-data center key-value store

duluthian
+4  A: 

Don't let the spatial scale (1000+ devices) mislead you as to the computational and/or storage scale. A few dozen 35-byte inserts per second is a trivial workload for any mainstream DBMS, even running on low-end hardware. Likewise, 142 million records per month is only on the order of 1~10 gigabytes of storage per month, without any compression, including indices.

In your question comment, you said:

"It's all about reliability, scalability and speed. It's very important that the solution scales easily (MongoDB autosharding?) just throwing in more nodes, and the speed is also very important

Reliability? Any mainstream DBMS can guarantee this (assuming you mean it's not going to corrupt your data, and it's not going to crash--see my discussion of the CAP theorem at the bottom of this answer). Speed? Even with a single machine, 10~100 times this workload should not be a problem. Scalability? At the current rate, a full year's data, uncompressed, even fully indexed, would easily fit within 100 gigabytes of disk space (likewise, we've already established the insert rate is not an issue).

As such, I don't see any clear need for an exotic solution like NoSQL, or even a distributed database--a plain, old relational database such as MySQL would be just fine. If you're worried about failover, just setup a backup server in a master-slave configuration. If we're talking 100s or 1000s of times the current scale, just horizontally partition a few instances based on the ID of the data-gathering device (i.e. {partition index} = {device id} modulo {number of partitions}).

Bear in mind that leaving the safe and comfy confines of the relational database world means abandoning both its representational model and its rich toolset. This will make your "complex datamining" much more difficult--you don't just need to put data into the database, you also need to get it out.

All of that being said, MongoDB and CouchDB are uncommonly simple to deploy and work with. They're also very fun, and will make you more attractive to any number of people (not just programmers--executives, too!).

The common wisdom is that, of the three NoSQL solutions you suggested, Cassandra is the best for high insert volume (of course, relatively speaking, I don't think you have high insert volume--this was designed to be used by Facebook); this is countered by being more difficult to work with. So unless you have some strange requirements you didn't mention, I would recommend against it, for your use case.

If you're positively set on a NoSQL deployment, you might want to consider the CAP theorem. This will help you decide between MongoDB and CouchDB. Here's a good link: http://blog.nahurst.com/visual-guide-to-nosql-systems. It all comes down to what you mean by "reliability": MongoDB trades availability for consistency, whereas CouchDB trades consistency for availability. (Cassandra allows you to finesse this tradeoff, per query, by specifying how many servers must be written/read for a write/read to succeed; UPDATE: Now, so can CouchDB, with BigCouch! Very exciting...)

Best of luck in your project.