views:

1235

answers:

7

Lots of "BAW"s (big ass-websites) are using data storage and retrieval techniques that rely on huge tables with indexes, and using queries that won't/can't use JOINs in their queries (BigTable, HQL, etc) to deal with scalability and sharding databases. How does that work when you have lots and lots of data that is very related?

I can only speculate that much of this joining has to be done on the application side of things, but doesn't that start to get expensive? What if you have to make several queries to several different tables to get information to compile? Isn't hitting the database that many times starting to get more expensive than just using joins in the first place? I guess it depends on how much data you've got?

And for commonly available ORMs, how do they tend to deal with the inability to use joins? Is there support for this in ORMs that are in heavy usage today? Or do most projects that have to approach this level of data tend to roll their own anyways?

So this is not applicable to any current project I'm doing, but it's something that's been in my head for several months now that I can only speculate as to what "best practices" are. I've never had a need to address this in any of my projects because they've never reached a scale where it is required. Hopefully this question helps other people as well..

As someone said below, ORMs "don't work" without joins. Are there other data access layers that are already available to developers working with data on this level?

EDIT: For some clarification, Vinko Vrsalovic said:

"I believe snicker is wants to talk about NO-SQL, where transactional data is denormalized and used in Hadoop or BigTable or Cassandra schemes."

This is indeed what I'm talking about.

Bonus points for those who catch the xkcd reference.

+11  A: 

A JOIN is a pure relational term and not all databases are relational.

Other database models have other ways to build relations.

Network databases use the endless chains of find a key - fetch the reference - find a key which should be programmed with a common programming language.

The code can be run on the application side or on the server side, but it's not SQL and not even set-based.

If designed properly, a network database can bу much faster than a relational one.

For instance, a network database can store a reference to another entity as a direct pointer to an offset in a file or even a block on a disk where the information about this entity is stored.

This makes traversing the networks extra fast — if you wrote an efficient code to do that.

A relational database can only store references as pairs of basic values like integers (or triples or tuples of higher order).

To find those values in the relational database the engine should do the following things:

  • Find out where the tuple containing the first value resides
  • Find the second value
  • Find the address of the root in a B-Tree holding the data the second number refers to
  • Traverse this tree
  • Find the pointer to the actual table (which may be stored as a B-Tree itself, in which case the pointer is the value of the PRIMARY KEY of the row we're after)
  • Find the table's row by the pointer or traverse the table
  • Finally, get the result.

And you can control this only to a certain extent. After than, you just issue the SQL query and wait.

Relational model made to simplify the developer's life, not to achieve the super speed always and no matter what.

This is same as assembly vs. higher-level languages, relational model being a higher-level language.

You may want to read the article in my blog

, in which I try to explain the differences between several commonly used database models.

Quassnoi
+1 for the clarity and the blog entry
Vincent Buck
+15  A: 

You're starting from a faulty assumption.

Data warehousing does not normalize data the same way that a transaction application normalizes. There are not "lots" of joins. There are relatively few.

In particular second and third Normal Form violations are not a "problem", since data warehouses are rarely updated. And when they are updated, it's generally only a status flag change to make a dimension rows as "current" vs. "not current".

Since you don't have to worry about updates, you don't decompose things down to the 2NF level where an update can't lead to anomalous relationships. No updates means no anomalies; and no decomposition and no joins. You can pre-join everything.

Generally, DW data is decomposed according to a star schema. This guides you to decompose the data into the numeric "fact" tables that contain the measures -- numbers with units -- and foreign key references to the dimension.

A dimension (or "business entity") is best thought of as a real-world thing with attributes. Often, this includes things like geography, time, product, customer, etc. These things often have complex hierarchies. The hierarchies are usually arbitrary, defined by various business reporting needs, and not modeled as separate tables, but simply columns in the dimension used for aggregation.


To address some of your questions.

"this joining has to be done on the application side of things". Kind of. The data is "pre-joined" prior to being loaded. The dimension data is often a join of relevant source data about that dimension. It's joined and loaded as a relatively flat structure.

It isn't updated. Instead of updates, additional historical records are inserted.

"but doesn't that start to get expensive?". Kind of. It takes some care to get the data loaded. However, there aren't a lot of reporting/analysis joins. The data is pre-joined.

The ORM issues are largely moot since the data is pre-joined. Your ORM maps to the fact or dimension as appropriate. Except in special cases, dimensions tend to be small-ish and fit entirely in memory. The exception is when you're in Finance (Banking or Insurance) or Public Utilities and have massive customer databases. These customer dimension rarely fits in memory.

S.Lott
I suppose I didn't mean this in a strictly "warehousing" sense but in an application which will be actively reading and writing data (like 80/20 R/W). In a star schema, typically there are lots of joins... what is my faulty assumption?
snicker
@snicker: your faulty assumption is the "lots of joins". Data werehousing can be done with few joins. In many cases, the application-level joins are zero cost because the tables are simple in-memory mappings.
S.Lott
Hmm.. I guess I need to research this a little bit more. What about an application that has many hierarchical relationships? And with the star schema approach, where you would typically use a join to get information from several tables using the fact table.. does that just become several SELECT queries... one for the fact table and one for each join?
snicker
@snicker: hierarchies are not done via "lots of joins". They're simply additional columns in the dimension. The fact-to-dimensions join is the most important kind. When the dimensions are small, they can be in-memory dictionaries, removing all join processing from the database.
S.Lott
So those columns just contain null values for objects lower in the inheritance chain?
snicker
I love your answer but I'm having a hard time following it. Could you add or point to a small comparison example between the two approaches?
Alex Feinman
IMO, given the question edit this is now a non-answer (a very good non-answer, btw). I believe snicker is wants to talk about NO-SQL (blog.oskarsson.nu/2009/06/…), where transactional data is denormalized and used in Hadoop or BigTable or Cassandra schemes. There is now no mention of datawarehousing anywhere in the question.
Vinko Vrsalovic
Vinko... thank you for clarifying... that is exactly what I wanted in the first place..
snicker
@Alex Feinman: Any reference on Data Warehouse Design or Star Schema Design will describe this approach. It's different from transactional normalization, and appears odd at first. Simply Goggle "Star Schema Design" and you'll find thousands of references.
S.Lott
There's more hand-waving in this answer than you can shake a stick at.
RBarryYoung
@RBarryYoung: It's hard to provide a tutorial on Star Schema design in this format. You can read up on Star Schema design elsewhere and see that it uses very large fact tables and relatively few joins to dimensions. Indeed, it can be flattened into a single monster table because there are no updates.
S.Lott
@S.Lott: You know I honestly cannot figure out what the basis of my objections were, you post looks perfectly fine to me. All I can figure is that I must have posted my comment to the wrong Answer. My sincere apologies. :-(
RBarryYoung
A: 

I think that in these situations you are going to be pretty much on your own and are going to have to roll everything yourself. I've not been there but have considered it for some of our projects. You can get pretty large with relational DBs (as SO demonstrates) so I shall continue to enjoy the relational goodness for now.

Rob West
A: 

Generally, data warehousing is built around using joins and data split into dimensions and fact tables (with so-called "star schemas" etc)

Joins will often be pre-calculated and stored as de-normalized tables.

I'm not aware of any ORM tools that work with database systems which don't allow joins, as these are not generally seen as traditional relational databases.

Jason
+3  A: 

When you denormalise your data in this manner, you do so to avoid the cost of joining disparate items; you accept that some data may be duplicated and that certain ways of combining it may be difficult, for the performance benefit of using simple queries.

If you're having to do any great amount of joining at the application level, it implies that you haven't denormalised it enough.

Ideally, you'll be able to make one query for any set of data you want. In practice, you shouldn't have to use more than two or three queries for any aspect of your application, and any application-level joining will be more of a trivial retrieval of stuff from the separate resultsets for insertion into the view.

This kind of thing is only really needed for truly massive datasets, and there are all kinds of tradeoffs involved. To give just one example: BigTable can't do aggregate queries, such as giving you a count. It can be used to give you a figure that's roughly accurate - in the sense that if you have, say, 12,149,173 records of which 23,721 were added in the last hour, it doesn't really matter if the best you can find out is that you have "about 12,100,000 records". If your application depends on knowing the precise figure at any given moment, then you shouldn't be using BigTable for it, is the general attitude.

NickFitz
+1  A: 

Applications like facebook have very few data changes, most of the time users are posting new items. So the fact that multiply records need updating when a item is changed is a lesser problem.

This lets allows the data not to be normalized without hitting the common problems with updates.

Applications like Amazon can afford to load all the data for a single user into RAM (how big is a shopping cart after all?), then update the data in RAM and write it out as a single data item.

Once again removing the need to have most data normalized.

You are trading scaling for ease of application development, so if you don’t need to scale to great heights you may wish to keep the ease of application development that RDBMS provide.

Ian Ringrose
+12  A: 

The way I look at it, a relational database is a general purpose tool to hedge your bets. Modern computers are fast enough, and RDBMS' are well-optimized enough that you can grow to quite a respectable size on a single box. By choosing an RDBMS you are giving yourself very flexible access to your data, and the ability to have powerful correctness constraints that make it much easier to code against the data. However the RDBMS is not going to represent a good optimization for any particular problem, it just gives you the flexibility to change problems easily.

If you start growing rapidly and realize you are going to have to scale beyond the size of a single DB server, you suddenly have much harder choices to make. You will need to start identifying bottlenecks and removing them. The RDBMS is going to be one nasty snarled knot of codependency that you'll have to tease apart. The more interconnected your data the more work you'll have to do, but maybe you won't have to completely disentangle the whole thing. If you're read-heavy maybe you can get by with simple replication. If you're saturating your market and growth is leveling off maybe you can partially denormalize and shard to fixed number of DB servers. Maybe you just have a handful of problem tables that can be moved to a more scalable data store. Maybe your usage profile is very cache friendly and you can just migrate the load to a giant memcached cluster.

Where scalable key-value stores like BigTable come in is when none of the above can work, and you have so much data of a single type that even when it's denormalized a single table is too much for one server. At this point you need to be able to partition it arbitrarily and still have a clean API to access it. Naturally when the data is spread out across so many machines you can't have algorithms that require these machines to talk to each other much, which many of the standard relational algorithms would require. As you suggest, these distributed querying algorithms have the potential to require more total processing power than the equivalent JOIN in a properly indexed relational database, but because they are parallelized the real time performance is orders of magnitude better than any single machine could do (assuming a machine that could hold the entire index even exists).

Now once you can scale your massive data set horizontally (by just plugging in more servers), the hard part of scalability is done. Well I shouldn't say done, because ongoing operations and development at this scale are a lot harder than the single-server app, but the point is application servers are typically trivial to scale via a share-nothing architecture as long as they can get the data they need in a timely fashion.

To answer your question about how commonly used ORMs handle the inability to use JOINs, the short answer is they don't. ORM stands for Object Relational Mapping, and most of the job of an ORM is just translating the powerful relational paradigm of predicate logic simple object-oriented data structures. Most of the value of what they give you is simply not going to be possible from a key-value store. In practice you will probably need to build up and maintain your own data-access layer that's suited to your particular needs, because data profiles at these scales are going to vary dramatically and I believe there are too many tradeoffs for a general purpose tool to emerge and become dominant the way RDBMSs have. In short, you'll always have to do more legwork at this scale.

That said, it will definitely be interesting to see what kind of relational or other aggregate functionality can be built on top of the key-value store primitives. I don't really have enough experience here to comment specifically, but there is a lot of knowledge in enterprise computing about this going back many years (eg. Oracle), a lot of untapped theoretical knowledge in academia, a lot of practical knowledge at Google, Amazon, Facebook, et al, but the knowledge that has filtered out into the wider development community is still fairly limited.

However now that a lot of applications are moving to the web, and more and more of the world's population is online, inevitably more and more applications will have to scale, and best practices will begin to crystallize. The knowledge gap will be whittled down from both sides by cloud services like AppEngine and EC2, as well as open source databases like Cassandra. In some sense this goes hand in hand with parallel and asynchronous computation which is also in its infancy. Definitely a fascinating time to be a programmer.

dasil003
Thank you. Incredibly useful info.
snicker
Here's some interesting technical information about real world distributed systems at Google:http://perspectives.mvdirona.com/2009/10/17/JeffDeanDesignLessonsAndAdviceFromBuildingLargeScaleDistributedSystems.aspx
dasil003
+1. Just stumbled across this answer. I have always felt that relational is a good starting point. Avoid premature optimization, etc. However, if you consider Google App Engine, they use an object-oriented data model (not strictly an ORM though) which is ultimately mapped to a k/v store.
jhs