views:

667

answers:

7

Is there a generalized procedure or algorithm for transforming a SQL subquery into a join, or vice versa? That is, is there a set of typographic operations that can be applied to a syntactically correct SQL query statement containing a subquery that results in a functionally equivalent statement without a subquery? If so, what are they (i.e., what's the algorithm), and in what cases do they not apply?

+2  A: 

At a really high level. to transform a sub-query to a JOIN:

  1. FROM: Table Names go into FROM
  2. JOIN The parts of the WHERE clause with table names on both sides determine (a) the type of JOIN (b) the condition of join
  3. WHERE The parts of the where clause without table names on both sides go into the WHERE clause
  4. SELECT Column names from Sub-Query go into the SELECT

Transforming a JOIN to Sub-Query entails the reverse of the above logic

Raj More
+1  A: 

In SQL Server, at least, the optimizer can do this at will, but I'm sure that there are constraints on when it does it. I'm sure that it was probably someone's PhD thesis to be able to do it in the computer.

When I do it the old fashioned human way, it's fairly straightforward - particularly if the subquery is already aliased - it can be pulled into a Common Table Expression first.

Cade Roux
+6  A: 

Converting a subquery into a JOIN can be pretty straightforward:

IN clause

 FROM TABLE_X x
WHERE x.col IN (SELECT y.col FROM TABLE_Y y)

...can be converted to:

FROM TABLE_X x
JOIN TABLE_Y y ON y.col = x.col

Your JOIN criteria is where you have direct comparison.

EXISTS clause

But there are complications when you look at the EXISTS clause. EXISTS are typically correllated, where the subquery is filtered by criteria from the table(s) outside the subquery. But the EXISTS is only for returning a boolean based on the criteria.

 FROM TABLE_X x
WHERE EXISTS (SELECT NULL
                FROM TABLE_Y y
               WHERE y.col = x.col)

...converted:

FROM TABLE_X x
JOIN TABLE_Y y ON y.col = x.col

Because of the boolean, there's a risk of more rows turning up in the resultset.

SELECTs in the SELECT clause

These should always be changed, with prejudice:

SELECT x.*,
       (SELECT MAX(y.example_col)
          FROM TABLE_Y y
         WHERE y.col = x.col)
  FROM TABLE_X x

You're probably noticing a patter now, but I made this a little different for an inline view example:

SELECT x.*,
       z.mc
  FROM TABLE_X x
  JOIN (SELECT y.col, --inline view within the brackets
               MAX(y.example_col) 'mc'
          FROM TABLE_Y y
      GROUP BY y.col) z ON z.col = x.col

The key is making sure the inline view resultset includes the column(s) needed to join to, along with the columns.

LEFT JOINs

You might've noticed I didn't have any LEFT JOIN examples - this would only be necessary if columns from the subquery use NULL testing (COALESCE on almost any db these days, Oracle's NVL or NVL2, MySQLs IFNULL, SQL Server's ISNULL, etc...):

SELECT x.*,
       COALESCE((SELECT MAX(y.example_col)
          FROM TABLE_Y y
         WHERE y.col = x.col), 0)
  FROM TABLE_X x

Converted:

   SELECT x.*,
          COALESCE(z.mc, 0)
     FROM TABLE_X x
LEFT JOIN (SELECT y.col,
                  MAX(y.example_col) 'mc'
             FROM TABLE_Y y
         GROUP BY y.col) z ON z.col = x.col

Conclusion

I'm not sure if that will satisfy your typographic needs, but hope I've demonstrated that the key is determining what the JOIN criteria is. Once you know the column(s) involved, you know the table(s) involved.

OMG Ponies
One difficulty with your first example, the `IN clause`... The version using a subquery might return fewer rows than the join version, if there are duplicate values in `y.col`... Using the `IN` syntax ignores the duplication, using the `JOIN` syntax duplicates the relevant rows.
Stobor
Can't you just use Distinct to remove dups?
Mike Brown
@Mike: `DISTINCT` or `GROUP BY` - either would work.
OMG Ponies
I would also add that dependent subqueries (that reference fields int eh outer query) transform into OUTER/CROSS APPLY
Remus Rusanu
+1  A: 

This rates a strong "it depends".

At one level, if you're talking about queries compatible with ANSI SQL 89 or 92*, then I would guess it's a definite maybe. If you have simple (or even not so simple) queries consisting of "basic" select, from, and where clauses, then yes, I would like to think that it is mathematically possible to define processes and procedures to create and "uncreate" subqueries (though how you might determine when to algorithmically form a subquery is beyond me). I think this "rationale" could be applied to outer joins and correlated subqueries.

At another level, I'd say "no way". Most of the time I write a subquery, it's because I can't think of a way to wedge it into the "main" query. Very rarely this involves correlated subqueries, but more often than not in involves what are, I'm pretty darn sure, proprietary extensions to the standards. How could you account for pivots, unpivots, ranking functions, TOP N clauses (which may well be ANSI standards, I'll admit to never having read them cover to cover), FULL or OUTER APPLY, and the like? And that's just parts of SQL Server, I'm sure Oracle, DB2, MYSQL, and most every other player has their own extensions that break the "purist" relational model.

Of course, they say it is impossible to prove a negative. I'd summarize with "can't be done until proven otherwise", leave the proof to the academics and theoreticians, and point out that even then, whatever system you purchase won't support it unless it makes financial sense for the manufacturer to work it in. (Does any system support OUTER UNION yet?)

** A bit of googling failed to produce any references to a third ANSI SQL standard. I know I heard talk about it years ago, did it ever happen?*

Philip Kelley
SQL-86, SQL-89, SQL-92, SQL:1999, SQL:2003, SQL:2006 and SQL:2008... http://en.wikipedia.org/wiki/SQL#Standardization
Stobor
Geez. Time was, getting your product to be SQL 89 and SQL 92 compatibile were all the rage. I don't think I even heard rumors of anything after 99...
Philip Kelley
Heheh... Yeah, I went to pull up a link to SQL-99, and stumbled across the rest of them... Although when I saw the list, I did recall in distant memory having seen something somewhere about SQL:2003...
Stobor
+1  A: 

A fully automatic system for transforming queries from sub-queries into joins would be relatively difficulty to build. You would need to take an input query, parse it into a parse tree and then perform some fairly complex pattern matches on the parse tree - replacing sections of the tree with new sections of the parse tree. At the end you do a traversal of the tree to output the new query.

There can be some amazingly good or bad performance repercussions. Sometimes a sub-query is much faster than a join. Sometimes it is the inverse.

Philip Schlump
Thanks for bringing up a parse tree, that's a good lead. I'm sure there are performance differences in different but functionally equivalent queries. What I'm mostly curious about is how the transformation is performed.
fatcat1111
Take FLex and Bison and build a LALR(1) parser for the SQL. For MySql the grammar is easily available. For Oracle not so easy. You will need to have access to the data model and it helps to have statistical info on the tables (# of rows, histogram on data). Now parse the SQL. Perform a match on a big chunk of the tree. Replace that chunk. Some experimentation with # of rows and the distribution will quickly tell you when it is good to replace with a join. In general if the number of rows is uniform then a join is good. If One table is large - then bad.
Philip Schlump
If you want more details than this please add info to comments and I can get into the details and write code samples for you. I have built systems that do this.
Philip Schlump
BNF grammars for SQL: http://savage.net.au/SQL/
Stobor
The source code for MySQL has a YACC grammar and lexer for SQL in it.It is open source. BISON is the Open Source version of YACC.
Philip Schlump
Also take a look at the source for SQLite - it is in the public domain so you can do whatever you want with it. They use LEMON a LALR parser generator that has some advantages over using BISON.
Philip Schlump
If you just want the raw BNF then, http://savage.net.au/SQL/sql-92.bnf.html
Philip Schlump
If you want to have direct ability to carry out transformations dirctly on the code, you take a look at the DMS Software Reengineering Toolkit (http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html). DMS provides the ability to parse, build parse trees, match parse trees against code patterns, and carry out the substitutions directly.
Ira Baxter
+2  A: 

It often is possible, and what's good is that the query optimizer can do it automatically, so you don't have to care about it.

erikkallen
I've been relying on the optimizer, but have been curious about how it does it.
fatcat1111
+5  A: 

This question relies on a basic knowledge of Relational Algebra. You need to ask yourself what kind of join is being performed. For example, an LEFT ANTI SEMI JOIN is like a WHERE NOT EXISTS clause.

Some joins do not allow duplicating of data, some do not allow eliminating data. Others allow extra fields to be available. I discuss this in my blog at http://msmvps.com/blogs/robfarley/archive/2008/11/09/join-simplification-in-sql-server.aspx

Also, please don't feel you need to do everything in JOINs. The Query Optimizer should take care of all of this for you, and you can often make your queries much harder to maintain this way. You may find yourself using an extensive GROUP BY clause, and having interesting WHERE .. IS NULL filters, which will only serve to disconnect the business logic from the query design.

A subquery in the SELECT clause (essentially a lookup) only provides an extra field, not duplication or elimination. Therefore, you would need to make sure that you enforce GROUP BY or DISTINCT values in your JOIN, and use an OUTER JOIN to guarantee that behaviour is the same.

A subquery in the WHERE clause can never duplicate data, or provide extra columns to the SELECT clause, so you should use GROUP BY / DISTINCT to check this. WHERE EXISTS is similar. (This the LEFT SEMI JOIN)

WHERE NOT EXISTS (LEFT ANTI SEMI JOIN) doesn't provide data, and doesn't duplicate rows, but can eliminate... for this you need to do LEFT JOINs and look for NULLs.

But the Query Optimizer should handle all this for you. I actually like having occasional subqueries in the SELECT clause, because it makes it very clear that I am not duplicating or eliminating rows. The QO can tidy it for me, but if I'm using a view or inline table-valued function, I want to make it clear to those who come after me that the QO can simplify it down a lot. Have a look at the Execution Plans of your original query, and you'll see that the system is providing the INNER/OUTER/SEMI joins for you.

The thing you really need to be avoiding (at least in SQL Server) are functions that use BEGIN and END (such as Scalar Functions). They may feel like they simplify your code, but they will actually be executed in a separate context, as the system sees them as procedural (not simplifiable).

I did a session on this kind of thing at the recent SQLBits V conference. It was recorded, so you should be able to watch it at some point (if you can put up with my jokes!)

Rob Farley
+1 very nice sum up
Remus Rusanu