views:

749

answers:

7

I've been reading a lot lately about how joins in DB queries slow things down. Evidently Google App Engine doesn't even allow them.

I'm wondering how people design an app with no joins though. For example I'm working on an app that has contacts and organizations. A contact can be in many organizations and an organization can have many contacts. How would it be possible to have that relationship without a third table that connects the two entities...

contacts --< contacts_organizations >-- organizations

Does it mean that in GAE you can't have a many-to-many relationship? You just leave out functionality that would require a join?

I guess you could have a TEXT organizations column in the contacts table containing a space-separated list of the organization IDs for each contact. That seems a little weird though.

+3  A: 

You use the db.ReferenceProperty to link objects, see Google App Engine: One-to-many JOIN for details and examples.

Robert Gamble
+13  A: 

It's a myth that joins slow down software, in the same way that it would be a myth to assert writing loops in application code slow down software.

I mean, why write a loop? That just runs the same lines of code again and again! Wasn't once enough? It's a tremendous waste!

The above statements are intended to be ironic.

My point is that a query contains a join for a purpose: to get the right answer. Using joins inefficiently or needlessly is of course poor design, like putting loop-invariant code inside a loop.

Avoiding joins as a general policy is an example of premature optimization. If your approach to writing efficient code is to come up with blanket rules like that, then avoiding joins is not going to help you.


As for Google App Engine, it does support relationships between entities, but since it isn't strictly a relational database model, the concept of a join doesn't really come up. Instead, you can get related entities from a given reference, which is more like an ORM interface to a model, it isn't the same thing as a join in SQL.

You can read more here: http://code.google.com/appengine/articles/modeling.html

(that link was in another answer on this thread, but it got deleted)

Bill Karwin
premature optimization is embedded in most of what is considered best practices
Orentet
That's neither true, nor a justification for employing premature optimization.
Bill Karwin
What about the GAE?
Michael Haren
See amended answer above.
Bill Karwin
'premature optimization is the root of all evil' --Donald Knuth
Gary Willoughby
By definition optimisation can only be "premature" when it's done too early. If optimising a process at a particular stage of design and development is correct then it is not premature. Debate whether a practice represents premature optimisation, but not whether prematurity is good or bad.
David Aldridge
@David: Right, my point is that the practice of making a uniform policy against joins is definitely premature.
Bill Karwin
+1  A: 

I think Google is ripping you of some computation-heavy mechanism so you will look for ways that'll utilize more of other kinds of resources, for example hard disks maintaining reference tables and/or counting tables instead of CPU cycles wasting for joins and aggregate calculation.

And it isn't impossible, you just need to workaround it using other kinds of resources to aid you.

chakrit
+7  A: 

Nit-picking point: Google doesn't disallow JOINs in their database to prevent users from running "expensive" queries; the database isn't relational, so the "JOIN" SQL verb isn't really applicable in the first place.

In this way, BigTable is the same as Amazon's SimpleDB - data is denormalised and stripped of schemas so that you effectively end up with huge, efficient hash tables with arbitrary data allowed in the buckets.

These hash tables are very, very easy to scale, especially compared with relational databases. For applications like GAE, extreme scalability is a higher priority than a full feature-set.

Alabaster Codify
Not nit-picky, but right on the whole point of the reference being quoted. For any standard rdbms, the quoted advice is nonsense.
le dorfier
+3  A: 

Usually when you are talking about databases not allowing joins, you are talking about very large databases that don't necessarily fit on one server. The recent examples being the cloud databases like Amazon's SimpleDB, Microsoft's SQL Data Services, and Google's App Engine Datastore. Some offer limited join capability, but the big difficulty is doing joins across "partitions". In large databases like this, you partition your data so it doesn't have to reside on the same server. You have to decide the right way to partition it.

In your example, I would store a list of organization keys in a field in the contacts table, and vice versa. The design of these databases is different than your typical normalized database. The tables are usually "sparse tables", which basically means each record can have any number of fields which are basically name/value pairs. Think of a products table on Amazon, and how many different fields there could be for different types of products. Books have number of pages, but MP3s have duration. In a sparse table, these records would be stored in the same table.

Lance Fisher
+1  A: 

You can perform joins in your application instead of the DB server, by fetching results from each table separately and then combining them, but for most joins doing this will only slow you down due to the latency of making several database round-trips instead of just one.

But: the honest truth is that joins are not your problem. By the time they are, if ever, you won't even need to ask this question. You can count the number of real life projects that get to this point on your fingers(primarily Ebay), and there's no evidence that eliminating joins entirely was the only way these projects could have been made to scale.

Seun Osewa
A: 

The databases you mention are, at best, versioned record stores designed to store very large volumes of data across multiple servers. To call them a 'database' would be a stretch. The do not support joins, nor ACID transactions, rollbacks, etc. You can write applications without them but will often have to do more work to provide the functionality.

For:

contacts --< contacts_organizations >-- organizations

You could deNormailize and store organizations in contacts and contacts in orgainizations. But you will have to enforce referential integrity in the application deal with simultaneous updated to both tables.

A better solution would be to store the data in three tables and do the 'joins' yourself.