Flickr's Cal Henderson gave the keynote address for DjangoCon 2008. He briefly touched on the Django framework's use of table joins in queries, asserting, "you don't use joins when you get to a certain scale." I expect that Henderson knows this stuff to the bone, but does anyone know what the likely reasoning for his claim is?
At a certain level of performance, you care a great deal about how many times you need to move a disk head to satisfy a query. To put together two records by way of a JOIN, you need to move the disk heads at least twice, unless one or both records are held completely in an index, and the index is cached. (Adding columns to an index so that the columns required to satisfy a query come out of the index is a common technique, but the wider your index tuples, the fewer you can cache.) And eventually you hit a scale where the records you want to join aren't under control of the same database instance.
Joins are a cost. You still join or group the data together and pay for it but you defer the cost to the cheaper application tier where it's easier to scale.
All large scalable systems have to do without join. The reason is for highly distributed databases such as BigTable which Google uses, the data is so large that they extend beyond a single machine. Joining two tables that are GBs in size is not scalable in any way. In fact if you do lots of joins you will see around 5million rows your RDBMS will start to clunk along relying heavily on indexes. Well indexes are also much harder in distributed databases and document databases like mongodb, couchdb etc.
The future is a good architectural model as the base, then copies of data and after insert update queues to make flattened join tables and update as each sets of rows are changed. Large RDBMS in MSSQL, Oracle, etc all get to the point where data warehousing and data flattening is needed for reporting speeds and high scalable needs like the web.
When we get to Terabytes of data the Join will be a thing of the past.
I think that is a gross generalization. Relational database concepts, joins included, are some of the most useful and valuable tools available to the modern application programmer.
Such concepts as denormalization, for massive datasets, have their merits. These days, we tend to take the word of large web application developers (vide Facebook, MySpace, etc) as gospel, without thinking of the context.
A well optimized join, with indices and foreign keys, will be fine unless you grow beyond the 300-400 million row mark (I can't speak for larger, as that is about the limit of where we start archiving on the biggest application I work on now).
I tend to disagree, because if you design your database well, you can get the performance. We have a multi-terabyte data warehouse modeled with a Kimball star schema and you have to JOIN the facts to the dimensions to do any kind of analysis and it performs (because it's partitioned and indexed). But I have to produce 200m rows of summary output in a single process. This kind of amount of information is simply not going to be pushed at a user.
However, for typical client-facing web applications which return a limited amount of data with each page generation, how much do you join? Instead, your application server could request rows, then request the related rows, etc. When a relational database wasn't available on a 64K small model 8086 handheld programmed in C, we had an ISAM library and we had to seek and read in one table and then seek and read in another table. If you aren't dealing with a lot of data, it's just as easy to do the work yourself like that.
But it is more programming and more code means more bugs. It also means a pretty weak database security and constraints/relational integrity model. Without the JOIN, you'll make more trips to the database. In the best case, you'll be returning the same amount of infomration from database server to web server. It can get worse if the web server is expecting to filter the rows against previous rows it has retrieved. Effectively, the web server is still doing the JOIN, but of course, it's a little easier to scale out web servers and requires less expertise on the relation engine optimization.
As you scale up, you start throwing out capabilities because they cost something. Subqueries first; then eventually even joins. That'll leave you doing whatever you need with tables and indexes - like Google.
SQL databases are typically built on isams - which are nothing but tables and indexes. So he's just saying he's getting closer to the metal. Which, come to think of it, is MyISAM I guess. So you save the optimizer the trouble of figuring it out for you. And I'm sure go on from there. But the first step would IMHO be to shed the overhead of the SQL parser/optimizer and manipulate the tables and indexes directly. Like it used to be in foxpro etc.
I'm somewhat exaggerating when I say they're evil.
For very large data sets, even when they fit within a single database, joining is an expensive operation (lots of non-sequential IO). With a typical web-app load (90/10 read/write), your reads need to be as cheap as possible, while you can spend more time on writes (and lazily replicate writes in many cases). In a typical high-performance web-app, you're going to want to perform all database IO within a couple of hundred milliseconds, so that's your first limit. Secondly, you want to be able to do plenty of concurrent requests. This tends to point to being able to collect records straight from index for large tables. Someone already mentioned that you don't need to send a ton of data to the browser, so performing the join across the whole dataset isn't needed, but consider ordering: if you can't get the records in the correct order straight from index, you're going to need to perform the entire join before ordering the results.
For multi-machine partitioned data, the same problems apply but on a larger scale. The usual solution is materialized views (data flattening) to enable join-like queries by performing multiple writes at insert/update/delete time (or lazily afterward) and using very simple indexed selects.
It's obviously the case that joins are useful and are perfectly good most of the time. But for large datasets in a database that doesn't natively support materialized views, this falls down at high concurrency on large datasets.
And the specific complaint about Django is that because of the inflexibility in changing models on existing data, people are encouraged to create 1-to-1 mapped tables which are only ever joined against, rather than adding columns to existing tables.