views:

1998

answers:

4

New school datastore paradigms like Google BigTable and Amazon SimpleDB are specifically designed for scalability, among other things. Basically, disallowing joins and denormalization are the ways this is being accomplished.

In this topic, however, the consensus seems to be that joins on large tables don't necessarilly have to be too expensive and denormalization is "overrated" to some extent Why, then, do these aforementioned systems disallow joins and force everything together in a single table to achieve scalability? Is it the sheer volumes of data that needs to be stored in these systems (many terabytes)?
Do the general rules for databases simply not apply to these scales? Is it because these database types are tailored specifically towards storing many similar objects?
Or am I missing some bigger picture?

+10  A: 

I'm not too familiar with them (I've only read the same blog/news/examples as everyone else) but my take on it is that they chose to sacrifice a lot of the normal relational DB features in the name of scalability - I'll try explain.

Imagine you have 200 rows in your data-table.

In google's datacenter, 50 of these rows are stored on server A, 50 on B, and 100 on server C. Additionally server D contains redundant copies of data from server A and B, and server E contains redundant copies of data on server C.

(In real life I have no idea how many servers would be used, but it's set up to deal with many millions of rows, so I imagine quite a few).

To "select * where name = 'orion'", the infrastructure can fire that query to all the servers, and aggregate the results that come back. This allows them to scale pretty much linearly across as many servers as they like (FYI this is pretty much what mapreduce is)

This however means you need some tradeoffs.

If you needed to do a relational join on some data, where it was spread across say 5 servers, each of those servers would need to pull data from eachother for each row. Try do that when you have 2 million rows spread across 10 servers.

This leads to tradeoff #1 - No joins.

Also, depending on network latency, server load, etc, some of your data may get saved instantly, but some may take a second or 2. Again, when you have dozens of servers, this gets longer and longer, and the normal approach of 'everyone just waits until the slowest guy has finished' no longer becomes acceptable.

This leads to tradeoff #2 - Your data may not always be immediately visible after it's written.

I'm not sure what other tradeoffs there are, but off the top of my head those are the main 2.

Orion Edwards
+13  A: 

Distributed databases aren't quite as naive as Orion implies; there has been quite a bit of work done on optimizing fully relational queries over distributed datasets. You may want to look at what companies like Teradata, Netezza, Greenplum, Vertica, AsterData, etc are doing. (Oracle got in the game, finally, as well, with their recent announcement; Microsoft bought their solition in the name of the company that used to be called DataAllegro).

That being said, when the data scales up into terabytes, these issues become very non-trivial. If you don't need the strict transactionality and consistency guarantees you can get from RDBMs, it is often far easier to denormalize and not do joins. Especially if you don't need to cross-reference much. Especially if you are not doing ad-hoc analysis, but require programmatic access with arbitrary transformations.

Denormalization is overrated. Just because that's what happens when you are dealing with a 100 Tera, doesn't mean this fact should be used by every developer who never bothered to learn about databases and has trouble querying a million or two rows due to poor schema planning and query optimization.

But if you are in the 100 Tera range, by all means...

Oh, the other reason these technologies are getting the buzz -- folks are discovering that some things never belonged in the database in the first place, and are realizing that they aren't dealing with relations in their particular fields, but with basic key-value pairs. For things that shouldn't have been in a DB, it's entirely possible that the Map-Reduce framework, or some persistent, eventually-consistent storage system, is just the thing.

On a less global scale, I highly recommend BerkeleyDB for those sorts of problems.

SquareCog
+1  A: 

If you are talking about data that is virtually read-only, the rules change. Denormalisation is hardest in situations where data changes because the work required is increased and there are more problems with locking. If the data barely changes then denormalisation is not so much of a problem.

David Aldridge
+2  A: 

So what I'm getting is that the whole "denormalize, no joins" philosophy exists, not because joins themselves don't scale in large systems, but because they're practically impossible to implement in distributed databases.

This seems pretty reasonable when you're storing largely invariant data of a single type (Like Google does). Am I on the right track here?

Rik
You pretty much nailed it.Also throw in 'there is fairly little need to cross-reference data in different "tables"'
SquareCog