views:

934

answers:

3

From Joel Spolsky's article on leaky abstractions:

[C]ertain SQL queries are thousands of times slower than other logically equivalent queries. A famous example of this is that some SQL servers are dramatically faster if you specify "where a=b and b=c and a=c" than if you only specify "where a=b and b=c" even though the result set is the same.

Does anyone know the details of this?

+1  A: 

I think the "certain" word is the operative term here. In order for the optimizer to really understand that a=c, it would have to parse and then connect the a's equality all of the way to "c" in a transitive relationship in order to deduce the relationship.

I think that in the future, SQL optimizers might get this smart (if they aren't already), so IMO this isn't really a general statement on Joel's part.

Dave Markle
I read "certain" as applying to the universe of SQL queries, not the universe of SQL queries containing WHERE clauses of the form "WHERE a=b AND b=c."
Jason
Oracle is currently (10g, maybe earlier) smart enough to apply transitive closure rules.
WW
+13  A: 

Obviously, a = b and b = c => a = c - this is related to transitive closure. The point Joel was making is that some SQL servers are poor at optimizing queries, so some of the SQL queries might be written with "extra" qualifiers as in the example.

In this example, remember that a, b and c as above often refer to different tables, and operations like a=b are performed as joins. Suppose the number of entries in table a is 1000, b is 500 and c is 20. Then join of a, b needs 1000x500 row comparisons (this is my dumb example; in practice there might be much better join algorithms that would reduce the complexity a lot), and b,c needs 500x20 comparisons. An optimizing compiler will determine that the join of b,c should be performed first and then the result should be joined on a = b since there are fewer expected rows with b=c. In total there are about 500x20 + 500x1000 comparisons for (b=c) and (a=b) respectively. After that intersections have to be computed between the returned rows (I guess also via joins, but not sure).

Suppose the Sql server could have a logic inference module that would also infer that this means a = c. Then it would probably perform join of b,c and then join of a,c (again this is a hypothetical case). This would take 500x20 + 1000x20 comparisons and after that intersection computations. If expected #(a=c) is lesser (due to some domain knowledge) then the second query will be a lot faster.

Overall my answer has become too long, but this means that SQL query optimization is not a trivial task, and that is why some SQL servers may not do it very well.

More can be found at http://en.wikipedia.org/wiki/Query_optimizer or from some expect on databases reading this.

But philosophically speaking, SQL (as an abstraction) was meant to hide all aspects of implementation. It was meant to be declarative (a SQL server can itself use sql query optimization techniques to rephrase the query to make them more efficient). But in the real world it is not so - often the database queries have to be rewritten by humans to make them more efficient.

Overall, the point of the article is that an abstraction can only be so good, and no abstraction is perfect.

Amit Kumar
Thank you for elaborating on the point of the article and the point of the example. But I was looking for the details behind the example. More explicitly, what caused the issue, and how do modern query optimizers get around the issue?
Jason
Thank you for elaborating. Very helpful.
Jason
+8  A: 

Here's a simpler explanation, where everything is all in one table.

Suppose A and C are both indexed, but B is not. If the optimizer can't realize that A = C, then it has to use the non-indexed B for both WHERE conditions.

But if you then tell the server that a=c, it can efficiently apply that filter first and greatly reduce the size of the working set.

Joel Coehoorn
Very intuitive. Thanks.
Jason