views:

118

answers:

4

We have

all aiming towards one common goal - making data management as scalable as possible.

By scalability what I understand is that the cost of the usage should not go up drastically when the size of data increases.

RDBMS's are slow when the amount of data is large as the number of indirections invariable increases leading to more IO's.

alt text

How do these custom scalable friendly data management systems solve the problem?

This is a figure from this document explaining Google BigTable:

alt text

Looks the same to me. How is the ultra-scalability achieved?

A: 

It's about using cheap comodity hardware to build a network/grid/cloud and spread the data and load (for example using map/reduce).

RDBMS databases seem to me like software being (originaly) designed to run on one supercomputer. You can use various hard drive arrays, DB clusters, but still..

The amount of data increased so there's one more reason to design new data storages with this in mind - scalability, high availability, terabytes of data.

Another thing - if you build a grid/cloud from cheap servers, it's fault tolerant because you store all data at three (?) different locations and at the same time it's cheap.

Back to your pictures - the first one is from one computer (typically), the second one from a network of computers.

dwich
A: 

One theoretical answer on scalability is at http://queue.acm.org/detail.cfm?id=1394128 - the ACID guarantees are expensive. See http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf for a counter-argument.

In fact just surviving power failures is expensive. Years ago now I compared MySQL against Oracle. MySQL was almost unbelieveably faster than Oracle, but we couldn't use it. MySQL of those days was built on top of Berkeley DB, which was miles faster than Oracle's full blown log-based database, but if the power went off while Berkely DB based MySQL was running, it was a manual process to get the database consistent again when the power went back on, and you'ld probably lose recent updates for good.

mcdowella
+1  A: 

The "traditional" SQL DBMS market really means a very small number of products, which have traditionally targeted business applications in a corporate setting. Massive shared-nothing scalability has not historically been a priority for those products or their customers. So it is natural that alternative products have emerged to support internet scale database applications.

This has nothing to do with the fact that these new products are not "Relational" DBMSs. The relational model can scale just as well as any other model. Arguably the relational model suits these types of massively scalable applications better than say, network (graph based) models. It's just that the SQL language has a lot of disadvantages and no-one has yet come up with suitable relational NOSQL (non-SQL) alternatives.

dportas
+1  A: 

Speaking specifically to your question about Bigtable, the difference is that the heirarchy in the diagram above is all there is. Each Bigtable tabletserver is responsible for a set of tablets (contiguous row ranges from a table); the mapping from row range to tablet is maintained in the metadata table, while the mapping from tablet to tabletserver is maintained in the memory of the Bigtable master. Looking up a row, or range of rows, requires looking up the metadata entry (which will almost certainly be in memory on the server that hosts it), then using that to look up the actual row on the server responsible for it - resulting in only one, or a few disk seeks.

In a nutshell, the reason this scales well is because it's possible to throw more hardware at it: given enough resources, the metadata is always in memory, and thus there's no need to go to disk for it, only for the data (and not always for that, either!).

Nick Johnson