ansaurus

Question

Speed up sql JOIN

Answer 1

+1 A:

Before you even start looking at changing your query, you should ensure that all tables have a clustered index that makes sense for both this query and all other vital queries. Having clustered indexes on your tables i vital in sql server to ensure proper performance.

kasperjj 2008-11-07 16:49:11

Answer 2

+1 A:

This doesn't make sense:

SELECT *
FROM [StaffEntry] s
LEFT JOIN [MainFrame] m ON m.ItemNumber = s.ItemNumber 
    AND m.Customer=s.Customer 
    AND m.CustomerPO = s.CustomerPO -- purchase order
    AND m.CustPORev = s.CustPORev  -- PO revision number
LEFT JOIN [Rejected] r ON r.OrderID = s.OrderID
WHERE s.EntryDate BETWEEN @StartDate AND @EndDate
    AND r.OrderID IS NULL AND s.OrderID IS NULL

if s.OrderID IS NULL, then r.OrderID = s.OrderID will never be true, so no rows from [Rejected] will ever be included, thus as given, it is equivalent to:

SELECT *
FROM [StaffEntry] s
LEFT JOIN [MainFrame] m ON m.ItemNumber = s.ItemNumber 
    AND m.Customer=s.Customer 
    AND m.CustomerPO = s.CustomerPO -- purchase order
    AND m.CustPORev = s.CustPORev  -- PO revision number
WHERE s.EntryDate BETWEEN @StartDate AND @EndDate
    AND s.OrderID IS NULL

Are you sure that code you posted is right?

Cade Roux 2008-11-07 16:49:54

Ah. Looks like you beat me to it by 27-seconds. lol

Kevin Fairchild 2008-11-07 16:51:12

D'Oh! You're right. I fixed it.

Joel Coehoorn 2008-11-07 16:54:53

Answer 3

+5 A:

First off, you can get rid of the second LEFT JOIN.

Your WHERE was removing out any matches, anyhow... For instance, if S.OrderID was 1 and there was a R.OrderID with a value of 1, the IS NULL enforcement in the WHERE wouldn't allow it. So it'll only return records where s.OrderID IS NULL, if I'm reading it correctly...

Secondly, if you're dealing with a large amount of data, adding on a NOLOCK table hint typically won't hurt. Assuming you don't mind the possibility of a dirty-read here or there :-P Usually worth the risk, though.

SELECT *
FROM [StaffEntry] s (nolock)
LEFT JOIN [MainFrame] m (nolock) ON m.ItemNumber = s.ItemNumber 
    AND m.Customer=s.Customer 
    AND m.CustomerPO = s.CustomerPO -- purchase order
    AND m.CustPORev = s.CustPORev  -- PO revision number
WHERE s.EntryDate BETWEEN @StartDate AND @EndDate
    AND s.OrderID IS NULL

Lastly, there was a part of your question which wasn't too clear for me...

"since I'm looking for records in the MainFrame table that don't exist, after doing the JOIN we have that ugly IS NULL in the where clause."

Ok... But are you trying to limit it to just where those MainFrame table records don't exist? If so, you'll want that expressed in the WHERE as well, right? So something like this...

SELECT *
FROM [StaffEntry] s (nolock)
LEFT JOIN [MainFrame] m (nolock) ON m.ItemNumber = s.ItemNumber 
    AND m.Customer=s.Customer 
    AND m.CustomerPO = s.CustomerPO -- purchase order
    AND m.CustPORev = s.CustPORev  -- PO revision number
WHERE s.EntryDate BETWEEN @StartDate AND @EndDate
    AND s.OrderID IS NULL AND m.ItemNumber IS NULL

If that's what you were intending with the original statement, perhaps you can get rid of the s.OrderID IS NULL check?

Kevin Fairchild 2008-11-07 16:50:21

+1 for nolock hint on SQL Server 2000

Corbin March 2008-11-07 16:53:50

Original code does have no-locks, and there was a mistake in what I had posted at first that is now fixed.

Joel Coehoorn 2008-11-07 16:55:32

Answer 4

+1 A:

In addition to what Kasperjj has suggested (which I do agree should be first), you might consider using temp tables to restrict the amount of data. Now, I know, I know that everyone says to stay away from temp tables. And i Usually do but sometimes, it is worth giving it a try because you can shrink the amount of data to join drastically with this method; this makes the overall query faster. (of course this does depend on how much you can shrink the result sets.)

My final thought is sometimes you will just need to experiment with different methods of pulling together the query. There might be too many variables for anyone here to give a answer.... On the other hand, people here are smart so I could be wrong.

Best of luck!

Regards, Frank

PS: I forgot to mention that if you wanted to try this temp table method, you'd also need to experiment with different indexes and primary keys on the temp tables. Depending on the amount of data, indexes and PKs can help.

Frank V 2008-11-07 16:56:04

Answer 5

+1 A:

Indexing on all the tables is going to be important. If you can't do much with the indexing on the [MainFrame] columns used in the join, you can also pre-limit the rows to be searched in [MainFrame] (and [Rejected], although that already looks like it has a PK)by specifying a date range - if the window of date should be roughly similar. This can cut down on the right hand side on that join.

I would also look at the execution plan and also do a simple black box evaluation of which of your JOINs is really the most expensive - m or r, by benchmarking the query with only one or the other. I would suspect it is m because of the multiple columns and missing useful indexes.

You could use m.EntryDate within a few days or months of your range. But if you already have indexes on Mainframe, the question is why aren't they being used, or if they are being used, why is the performance so slow.

Cade Roux 2008-11-07 17:01:57

M is indexed by customer/po/rev, upvote because limiting the rows in mainframe is good idea. I'm just not sure how to implement it, as the dates are indeed only "roughly" correlated, and not at all exact.

Joel Coehoorn 2008-11-07 17:07:13

Answer 6

A:

Update:
In case it wasn't already obvious, I made a mistake in the code for the original question. That's now fixed, but unfortunately it means some of the better responses here are actually going the completely wrong direction.

I also have some statistics updates: I can make the query run nice and quick by severely limiting the data range used with StaffEntry.EntryDate. Unfortunately, I'm only able to do that because after running it the long way once I then know exactly which dates I care about. I don't normally know that in advance.

Tthe execution plan from the original run showed 78% cost for a clustered index scan on the StaffEntry table, and 11% cost on an index seek for the MainFrame table, and then 0% cost on the join itself. Running it using the narrow date range, that changes to 1% for an index seek of StaffEntry, 1% for an index seek of 'MainFrame', and 93% for a table scan of Rejected. These are 'actual' plans, not estimated.

Joel Coehoorn 2008-11-07 17:31:10

If the original code was flawed, you might consider closing this one and -- if the issue still needs addressed -- open a new one using the correct code :) And include any relevant info it seems like people might need in order to help you out. Good luck!

Kevin Fairchild 2008-11-07 20:01:13

Answer 7

+1 A:

try changing LEFT JOIN [Rejected] r with (nolock) ON r.OrderID = s.OrderID into the RIGHT MERGE JOIN:

SELECT ...
FROM [Rejected] r
     RIGHT MERGE JOIN [StaffEntry] s with (nolock) ON r.OrderID = s.OrderID
     LEFT JOIN [MainFrame] m with (nolock) ON....

Mladen Prajdic 2008-11-07 23:20:23

That makes some sense: start with the smaller table and add the one with the matching index first. Then join in the "messy" table later. It keeps the result set smaller, longer, which is something I try to preach around here.

Joel Coehoorn 2008-11-10 16:57:45

ansaurus

tags:

views:

answers:

Speed up sql JOIN

related questions