Is there a generalized procedure or algorithm for transforming a SQL subquery into a join, or vice versa? That is, is there a set of typographic operations that can be applied to a syntactically correct SQL query statement containing a subquery that results in a functionally equivalent statement without a subquery? If so, what are they (i.e., what's the algorithm), and in what cases do they not apply?
At a really high level. to transform a sub-query to a JOIN:
- FROM: Table Names go into FROM
- JOIN The parts of the WHERE clause with table names on both sides determine (a) the type of JOIN (b) the condition of join
- WHERE The parts of the where clause without table names on both sides go into the WHERE clause
- SELECT Column names from Sub-Query go into the SELECT
Transforming a JOIN to Sub-Query entails the reverse of the above logic
In SQL Server, at least, the optimizer can do this at will, but I'm sure that there are constraints on when it does it. I'm sure that it was probably someone's PhD thesis to be able to do it in the computer.
When I do it the old fashioned human way, it's fairly straightforward - particularly if the subquery is already aliased - it can be pulled into a Common Table Expression first.
Converting a subquery into a JOIN can be pretty straightforward:
IN
clause
FROM TABLE_X x
WHERE x.col IN (SELECT y.col FROM TABLE_Y y)
...can be converted to:
FROM TABLE_X x
JOIN TABLE_Y y ON y.col = x.col
Your JOIN criteria is where you have direct comparison.
EXISTS
clause
But there are complications when you look at the EXISTS
clause. EXISTS are typically correllated, where the subquery is filtered by criteria from the table(s) outside the subquery. But the EXISTS is only for returning a boolean based on the criteria.
FROM TABLE_X x
WHERE EXISTS (SELECT NULL
FROM TABLE_Y y
WHERE y.col = x.col)
...converted:
FROM TABLE_X x
JOIN TABLE_Y y ON y.col = x.col
Because of the boolean, there's a risk of more rows turning up in the resultset.
SELECT
s in the SELECT clause
These should always be changed, with prejudice:
SELECT x.*,
(SELECT MAX(y.example_col)
FROM TABLE_Y y
WHERE y.col = x.col)
FROM TABLE_X x
You're probably noticing a patter now, but I made this a little different for an inline view example:
SELECT x.*,
z.mc
FROM TABLE_X x
JOIN (SELECT y.col, --inline view within the brackets
MAX(y.example_col) 'mc'
FROM TABLE_Y y
GROUP BY y.col) z ON z.col = x.col
The key is making sure the inline view resultset includes the column(s) needed to join to, along with the columns.
LEFT JOIN
s
You might've noticed I didn't have any LEFT JOIN examples - this would only be necessary if columns from the subquery use NULL testing (COALESCE
on almost any db these days, Oracle's NVL
or NVL2
, MySQLs IFNULL
, SQL Server's ISNULL
, etc...):
SELECT x.*,
COALESCE((SELECT MAX(y.example_col)
FROM TABLE_Y y
WHERE y.col = x.col), 0)
FROM TABLE_X x
Converted:
SELECT x.*,
COALESCE(z.mc, 0)
FROM TABLE_X x
LEFT JOIN (SELECT y.col,
MAX(y.example_col) 'mc'
FROM TABLE_Y y
GROUP BY y.col) z ON z.col = x.col
Conclusion
I'm not sure if that will satisfy your typographic needs, but hope I've demonstrated that the key is determining what the JOIN criteria is. Once you know the column(s) involved, you know the table(s) involved.
This rates a strong "it depends".
At one level, if you're talking about queries compatible with ANSI SQL 89 or 92*, then I would guess it's a definite maybe. If you have simple (or even not so simple) queries consisting of "basic" select, from, and where clauses, then yes, I would like to think that it is mathematically possible to define processes and procedures to create and "uncreate" subqueries (though how you might determine when to algorithmically form a subquery is beyond me). I think this "rationale" could be applied to outer joins and correlated subqueries.
At another level, I'd say "no way". Most of the time I write a subquery, it's because I can't think of a way to wedge it into the "main" query. Very rarely this involves correlated subqueries, but more often than not in involves what are, I'm pretty darn sure, proprietary extensions to the standards. How could you account for pivots, unpivots, ranking functions, TOP N clauses (which may well be ANSI standards, I'll admit to never having read them cover to cover), FULL or OUTER APPLY, and the like? And that's just parts of SQL Server, I'm sure Oracle, DB2, MYSQL, and most every other player has their own extensions that break the "purist" relational model.
Of course, they say it is impossible to prove a negative. I'd summarize with "can't be done until proven otherwise", leave the proof to the academics and theoreticians, and point out that even then, whatever system you purchase won't support it unless it makes financial sense for the manufacturer to work it in. (Does any system support OUTER UNION yet?)
** A bit of googling failed to produce any references to a third ANSI SQL standard. I know I heard talk about it years ago, did it ever happen?*
A fully automatic system for transforming queries from sub-queries into joins would be relatively difficulty to build. You would need to take an input query, parse it into a parse tree and then perform some fairly complex pattern matches on the parse tree - replacing sections of the tree with new sections of the parse tree. At the end you do a traversal of the tree to output the new query.
There can be some amazingly good or bad performance repercussions. Sometimes a sub-query is much faster than a join. Sometimes it is the inverse.
It often is possible, and what's good is that the query optimizer can do it automatically, so you don't have to care about it.
This question relies on a basic knowledge of Relational Algebra. You need to ask yourself what kind of join is being performed. For example, an LEFT ANTI SEMI JOIN is like a WHERE NOT EXISTS clause.
Some joins do not allow duplicating of data, some do not allow eliminating data. Others allow extra fields to be available. I discuss this in my blog at http://msmvps.com/blogs/robfarley/archive/2008/11/09/join-simplification-in-sql-server.aspx
Also, please don't feel you need to do everything in JOINs. The Query Optimizer should take care of all of this for you, and you can often make your queries much harder to maintain this way. You may find yourself using an extensive GROUP BY clause, and having interesting WHERE .. IS NULL filters, which will only serve to disconnect the business logic from the query design.
A subquery in the SELECT clause (essentially a lookup) only provides an extra field, not duplication or elimination. Therefore, you would need to make sure that you enforce GROUP BY or DISTINCT values in your JOIN, and use an OUTER JOIN to guarantee that behaviour is the same.
A subquery in the WHERE clause can never duplicate data, or provide extra columns to the SELECT clause, so you should use GROUP BY / DISTINCT to check this. WHERE EXISTS is similar. (This the LEFT SEMI JOIN)
WHERE NOT EXISTS (LEFT ANTI SEMI JOIN) doesn't provide data, and doesn't duplicate rows, but can eliminate... for this you need to do LEFT JOINs and look for NULLs.
But the Query Optimizer should handle all this for you. I actually like having occasional subqueries in the SELECT clause, because it makes it very clear that I am not duplicating or eliminating rows. The QO can tidy it for me, but if I'm using a view or inline table-valued function, I want to make it clear to those who come after me that the QO can simplify it down a lot. Have a look at the Execution Plans of your original query, and you'll see that the system is providing the INNER/OUTER/SEMI joins for you.
The thing you really need to be avoiding (at least in SQL Server) are functions that use BEGIN and END (such as Scalar Functions). They may feel like they simplify your code, but they will actually be executed in a separate context, as the system sees them as procedural (not simplifiable).
I did a session on this kind of thing at the recent SQLBits V conference. It was recorded, so you should be able to watch it at some point (if you can put up with my jokes!)