views:

964

answers:

6

I am looking for a backend solution for an application written in Ruby on Rails or Merb to handle data with several billions of records. I have a feeling that I suppose to go with a distributed model and at the moment I looked at

HBase with Hadoop

Couchdb

Problems with HBase solution as I see it -- ruby support is not very strong, and Couchdb did not reach 1.0 version yet.

Do you have suggestion what would you use for such a big amount of data?

Data will require rather fast imports sometimes of 30-40Mb at once, but imports will come in chunks. So ~95% of the time data will be read only.

A: 

What's the data going to look like?

jonnii
Data will require rather fast imports sometimes of 30-40Mb at once, but imports will come in chunks. So ~95% of the time data will be read only.
dimus
Data would look like this:_id "ffffc0559b7441f535b2de431d7bf92a"_rev "1035390285"global_unique_identifier "urn:lsid:catalogueoflife.org:taxon:d77fc2c0-29c1-102b-9a4a-00304854f820:ac2008"rank "Family"scientific_name "Plasmobatidae"
dimus
+1  A: 

There's a number of different solutions people have used. In my experience it really depends more on your usage patterns related to that data and not the sheer number of rows per table.

For example, "How many inserts/updates per second are occurring." Questions like these will play into your decision of what back-end database solution you'll choose.

Take Google for example: There didn't really exist a storage/search solution that satisfied their needs, so they created their own based on a Map/Reduce model.

Robert Walker
A: 

The backend will depend on the data and how the data will be accessed.

But for the ORM, I'd most likely use DataMapper and write a custom DataObjects adapter to get to whatever backend you choose.

booch
+1  A: 

Depending on your actual data usage, MySQL or Postgres should be able to handle a couple of billion records on the right hardware. If you have a particular high volume of requests, both of these databases can be replicated across multiple servers (and read replication is quite easy to setup (compared to multiple master/write replication).

The big advantage of using a RDBMS with Rails or Merb is you gain access to all of the excellent tool support for accessing these types of databases.

My advice is to actually profile your data in a couple of these systems and take it from there.

Toby Hede
A: 

A word of warning about HBase and other projects of that nature (don't know anything about CouchDB -- I think it's not really a db at all, just a key-value store):

  1. Hbase is not tuned for speed; it's tuned for scalability. If response speed is at all an issue, run some proofs of concept before you commit to this path.
  2. Hbase does not support joins. If you are using ActiveRecord and have more than one relation.. well you can see where this is going.

The Hive project, also built on top of Hadoop, does support joins; so does Pig (but it's not really sql). Point 1 applies to both. They are meant for heavy data processing tasks, not the type of processing you are likely to be doing with Rails.

If you want scalability for a web app, basically the only strategy that works is partitioning your data and doing as much as possible to ensure the partitions are isolated (don't need to talk to each other). This is a little tricky with Rails, as it assumes by default that there is one central database. There may have been improvements on that front since I looked at the issue about a year and a half ago. If you can partition your data, you can scale horizontally fairly wide. A single MySQL machine can deal with a few million rows (PostgreSQL can probably scale to a larger number of rows but might work a little slower).

Another strategy that works is having a master-slave set up, where all writes are done by the master, and reads are shared among the slaves (and possibly the master). Obviously this has to be done fairly carefully! Assuming a high read/write ratio, this can scale pretty well.

If your organization has deep pockets, check out what Vertica, AsterData, and Greenplum have to offer.

SquareCog
A: 

I'm not sure what CouchDB not being at 1.0 has to do with it. I'd recommend doing some testing with it (just generate a billion random documents) and see if it'll hold up. I'd say it will, despite not having a specific version number.

CouchDB will help you a lot when it comes to partitioning/sharding your data and like, seems like it might fit with your project -- especially if your data format might change in the future (adding or removing fields) since CouchDB databases have no schema.

There are plenty of optimizations in CouchDB for read-heavy apps as well and, based on my experience with it, is where it really shines.

thenduks