views:

553

answers:

2

What scenario makes more sense - host several EC2 instances with MongoDB installed, or much rather use the Amazon SimpleDB webservice?

When having several EC2 instances with MongoDB I have the problem of setting the instance up by myself.

When using SimpleDB I have the problem of locking me into Amazons data structure right?

What differences are there development-wise? Shouldn't I be able to just switch the DAO of my service layers, to either write to MongoDB or AWS SimpleDB?

+4  A: 

Hi,

SimpleDB has some scalability limitations. You can only scale by sharding and have more higher latency than mongodb and cassandra, has a throughput limit and it's price is higher than other options. Scalability is manual (you have to shard).

If you need wider query options and you have high read rate and you don't have so much data mongodb is better. But for durability, you need to use at least 2 mongodb server instances as master/slave. Otherwise you can lost up to last 1 minute of your data. Scalability is manual. It's much more faster than simpledb. Autosharding is implemented in 1.6 version.

Cassandra have weak query options but as durable as postgresql. It's as fast as mongo and faster on higher data size. Write operations are faster than read operations on cassandra. It can scale automatically by firing ec2 instances, but you have to modify config files a bit (if I remembered correctly). If you have terabytes of data cassandra is your best bet. No need to shard your data, it was designed distributed from the 1st day. You can have any number of copies for all your data and if some servers dead it automatically return the results from live ones and distribute dead server's data to others. It's highly fault tolerant. You can include any number of instances, it's much more easy to scale than other options. Has strong .net and java client options. Thay have connection pooling, load balancing, marking of dead servers,...

Another option is hadoop for big data but it's not as realtime as others, you can use hadoop for datawarehousing. Both cassandra and mongo doesn't have transactions, so if you need transactions postgresql is a better fit. Another option is Amazon RDS, but it's performance is bad and price is high. If you want to use databases or simpledb you may also need data caching (eg: memcached).

For web apps, if your data is small I recommend mongo, if it's large cassandra is much more better. You don't need a caching layer with both mongo and cassandra, they're already fast. I don't recommend simpledb, it also locks you to amazon as you said.

If you're using c#, java or scala you can write an interface and implement it for mongo, mysql, cassandra or anything else for data access layer. It's simpler in dynamic languages (eg rub,python,php). You can write a provider for two of them if you want and can change the storage maybe in runtime by a only a configuration change, they're all possible. Development with mongo,cassandra and simpledb are much simpler than a database, and they're free of schema, it also depends on the client library/connector you're using. The simplest one is mongo. There's only one index per table in cassandra, so you've to manage other indexes yourself, but with the 0.7 release of cassandra secondary indexes will bu possible as I know. You can also start with any of them and replace it in the future if you have to.

Regards Serdar Irmak

sirmak
+1  A: 

I think you have both a question of time and speed.

MongoDB / Cassandra are going to be much faster, but you will have to invest $$$ to get them going. This means you'll need to run / setup server instances for all them and figure out how they work.

On the other hand, you don't have to per a "per transaction" cost directly, you just pay for the hardware which is probably more efficient for larger services.

In the Cassandra / MongoDB fight here's what you'll find (based on testing I'm personally involved with over the last few days).

Cassandra:

  • Scaling / Redundancy is very core
  • Configuration can be very intense
  • To do reporting you need map-reduce, for that you need to run a hadoop layer. This was a pain to get configured and a bigger pain to get performant.

MongoDB:

  • Configuration is relatively easy (even for the new sharding, this week)
  • Redundancy is still "getting there"
  • Map-reduce is built-in and it's easy to get data out.

Honestly, given the configuration time required for our 10s of GBs of data, we went with MongoDB on our end. I can imagine using SimpleDB for "must get these running" cases. But configuring a node to run MongoDB is so ridiculously simple that it may be worth skipping the "SimpleDB" route.

In terms of DAO, there are tons of libraries already for Mongo. The Thrift framework for Cassandra is well supported. You can probably write some simple logic to abstract away connections. But it will be harder to abstract away things more complex than simple CRUD.

Gates VP