tags:

views:

947

answers:

4

I wonder if anyone can help improve my understanding of JOINs in SQL. [If it is significant to the problem, I am thinking MS SQL Server specifically.]

Take 3 tables A, B [A related to be by some A.AId], and C [B related to C by some B.BId]

If I compose a query e.g

SELECT *
FROM A JOIN B 
ON A.AId = B.AId

All good - I'm sweet with how this works.

What happens when Table C (Or some other D,E, .... gets added)

In the situation

SELECT *
FROM A JOIN B 
  ON A.AId = B.AId
JOIN C ON C.BId = B.BId

What is C joining to? - is it that B table (and the values in the B table?) Or is it some other temporary result set that is the result of the A+B Join that the C table is joined to?

[The implication being not all values that are in the B table will necessarily be in the temporary result set A+B based on the join condition for A,B]

A specific (and fairly contrived) example of why I am asking is because I am trying to understand behaviour I am seeing in the following:

Tables 
Account (AccountId, AccountBalanceDate, OpeningBalanceId, ClosingBalanceId)
Balance (BalanceId)
BalanceToken (BalanceId, TokenAmount)

Where:
Account->Opening, and Closing Balances are NULLABLE 
(may have opening balance, closing balance, or none)

Balance->BalanceToken is 1:m - a balance could consist of many tokens

Conceptually, Closing Balance of a date, would be tomorrows opening balance

If I was trying to find a list of all the opening and closing balances for an account

I might do something like

SELECT AccountId
, AccountBalanceDate
, Sum (openingBalanceAmounts.TokenAmount) AS OpeningBalance
, Sum (closingBalanceAmounts.TokenAmount) AS ClosingBalance
FROM Account A 
   LEFT JOIN BALANCE OpeningBal 
      ON A.OpeningBalanceId = OpeningBal.BalanceId
   LEFT JOIN BALANCE ClosingBal 
      ON A.ClosingBalanceId = ClosingBal.BalanceId
   LEFT JOIN BalanceToken openingBalanceAmounts 
      ON openingBalanceAmounts.BalanceId = OpeningBal.BalanceId
   LEFT JOIN BalanceToken closingBalanceAmounts 
      ON closingBalanceAmounts.BalanceId = ClosingBal.BalanceId
   GROUP BY AccountId, AccountBalanceDate

Things work as I would expect until the last JOIN brings in the closing balance tokens - where I end up with duplicates in the result.

[I can fix with a DISTINCT - but I am trying to understand why what is happening is happening]

I have been told the problem is because the relationship between Balance, and BalanceToken is 1:M - and that when I bring in the last JOIN I am getting duplicates because the 3rd JOIN has already brought in BalanceIds multiple times into the (I assume) temporary result set.

I know that the example tables do not conform to good DB design

Apologies for the essay, thanks for any elightenment :)

Edit in response to question by Marc

Conceptually for an account there should not be duplicates in BalanceToken for An Account (per AccountingDate) - I think the problem comes about because 1 Account / AccountingDates closing balance is that Accounts opening balance for the next day - so when self joining to Balance, BalanceToken multiple times to get opening and closing balances I think Balances (BalanceId's) are being brought into the 'result mix' multiple times. If it helps to clarify the second example, think of it as a daily reconciliation - hence left joins - an opening (and/or) closing balance may not have been calculated for a given account / accountingdate combination.

+2  A: 

We know that the data from B is going to be filtered by the (inner) join to A (the data in A is also filtered). So if we (inner) join from B to C, thus the set C is also filtered by the relationship to A. And note also that any duplicates from the join will be included.

However; what order this happens in is up to the optimizer; it could decide to do the B/C join first then introduce A, or any other sequence (probably based on the estimated number of rows from each join and the appropriate indexes).


HOWEVER; in your later example you use a LEFT OUTER join; so Account is not filtered at all, and may well my duplicated if any of the other tables have multiple matches.

Are there duplicates (per account) in BalanceToken?

Marc Gravell
Hi Marc - thanks for your response, I have edited the initial question with information in response to your question
Delaney
Are you sure on that part, where say the optimizer would decide which join he does first? I think the second join uses the results from the first and so, the first join has to be execute before second, right?
Tim Büthe
The final result needs to respect the semantics of the query - but the optimizer can start pretty much anywhere (the results must be the same, though). For example, if we are inner-joining a table with 1M rows to a table with 10 rows, it probably makes sense to star with the 10. Obviously this is different for any kind of outer join - but the optimizer still has lots of options **as long as the result is the same**.
Marc Gravell
+1  A: 

I often find it helps to view the actual execution plan. In query analyser/management studio, you can turn this on for queries from the Query menu, or use Ctrl+M. After running the query, the plan that was executed is shown in another result tab. From this you'll see that C and B are joined first, and then the result is joined with A. The plan might vary depending on information the DBMS has because both joins are inner, making it A-and-B-and-C. What I mean is that the result will be the same regardless of which is joined first, but the time it takes might differ greatly, and this is where the optimiser and hints come into play.

Bernhard Hofmann
+3  A: 

Conceptually here is what happens when you join three tables together.

  1. The optimizer comes up with a plan, which includes a join order. It could be A, B, C, or C, B, A or any of the combinations
  2. The query execution engine applies any predicates (WHERE clause) to the first table that do not invovle any of the other tables. It selects out the columns mentioned in the join conditions or the select list or the order by list. Call this result A
  3. It joins this result set to the second table. For each row it joins to the second table, applying any predicates that may apply to the second table. This results in another temporary resultset.
  4. Then it joins in the final table and applies the ORDER BY

This is conceptually what happens. Infact there are many possible optimizations along the way. The advantage of the relational model is that the sound mathematical basis makes various transformations of plan possible while not changing the correctness.

For example, there is really no need to generate the full result sets along the way. The ORDER BY may instead by done via accessing the data using an index in the first place. There are lots of types of joins that can be done as well.

WW
Thanks WW - I think that clears up what I was wondering - subsequent joins (table 3 onwards) happen against / to the 'last' interim result set that is being built along the way - not the original tables.So saying from the first example, ... B JOIN C ON B.BId = C.BId - joins agains rows that are in the A+B intermediate resultset NOT against the rows that are in the B table.[conceptually]
Delaney
Yes, only two result sets are ever joined together at once and it builds up. This is why the join order is very important to the plan.
WW
+1  A: 

Joins can be tricky, and much of the behavior is of course dictated by how the data is stored in the actual tables.

Without seeing the tables it's hard to give a clear answer in your particular case but I think the basic issue is that you are summing over multiple result sets that are being combined into one.

Perhaps instead of multiple joins you should make two separate temporary tables in your query, one with the accountID, date and sum of openingbalances, a second one with the accountID, date and sum of closing balances, then joining those two on AccountID and date.

In order to find out exactly what is happening with joins, also in your specific case, I would do the following:

Change the initial part

SELECT accountID Accountbalancedate, sum(...) as openingbalance, sum(...) as closingbalance FROM

to simply

"SELECT * FROM"

Study the resulting table, and you will see exactly what data is being duplicated. Remove the joins one by one and see what happens. This should give you a clue to what it is about your particular data that is causing the dupes.

If you open the query in SQL server management studio (Free version exists) you can edit the query in the designer. The visual view of how the tables are being joined might also help you realize what's going on.

Console