views:

101

answers:

2

I have a poor man's replication setup that I can't do anything about. Some identifying data (basically primary key) from a call_table is copied into another table via a simple trigger, and then the "replication server" runs a stored procedure to copy the data from the queue table to a #temp table (to prevent locking in SQL 6.5 is the case that was made to me). Finally, a query uses the key data from the temp table to pull data back to the replication server from the call_table using this query:

  /* select the data to return to poor man replication server */     
  SELECT c.id,
         c.date,
      c.time,
      c.duration,
      c.location
    FROM #tmp q, call_table c (NOLOCK)
   WHERE q.id=c.id  
     AND q.date=c.date
     AND q.time=c.time
     AND q.duration=c.duration
     AND q.location=c.location
GROUP BY c.id,
         c.date,
         c.time,
         c.duration,
         c.location

Once a night the queue table is purged and this starts over. While investigating this, the implicit cross join jumped at me (I'm on the side that they are usually evil), but then I read The power of the Cross Join. I'm here because I'm not quite convinced. Say the temp table has about 10,000 rows for the day, the call_table has about 100,000 for the month so far. How is this query going to work? Does it mash the two tables together for a total of 1,000,000,000 in memory, then use the group clause to trim it back down? Could you explain what steps SQL takes to compile the results?

Execution Plans:

My Query:
      |--Hash Match Root(Aggregate, HASH:([c].[id], [c].[date], [c].[location], [c].[time], [c].[duration]), RESIDUAL:(((((((((((((((((((((([c].[id]=[c].[id] AND [c].[PIN]=[c].[PIN]) AND [c].[ORIG]=[c].[ORIG]) AND [c].[date]=[c].[date]) AND [c].[CTIME]=[c].[CTIME
           |--Hash Match Team(Inner Join, HASH:([q].[id], [q].[date], [q].[location], [q].[time], [q].[duration])=([c].[id], [c].[date], [c].[location], [c].[time], [c].[duration]), RESIDUAL:(((([c].[id]=[q].[id] AND [c].[location]=[q].[location]) AND [c].[duration]=[q].[duration]) AND [
                |--Table Scan(OBJECT:([db].[dbo].[queue] AS [q]))
                |--Table Scan(OBJECT:([db].[dbo].[call_table] AS [c]))

Yours:        
  |--Merge Join(Right Semi Join, MERGE:([q].[id], [q].[date], [q].[time], [q].[duration], [q].[location])=([c].[id], [c].[date], [c].[time], [c].[duration], [c].[location]), RESIDUAL:(((([q].[id]=[c].[id] AND [q].[location]=[c].[location]) AND [q].[duration]=[c].[duration]) AND [q].[
       |--Index Scan(OBJECT:([db].[dbo].[queue].[PK_queue] AS [q]), ORDERED)
       |--Sort(ORDER BY:([c].[id] ASC, [c].[date] ASC, [c].[time] ASC, [c].[duration] ASC, [c].[location] ASC))
            |--Table Scan(OBJECT:([db].[dbo].[call_table] AS [c]))
+1  A: 

The query you described is no way a CROSS JOIN.

SQL Server is smart enough to transform the WHERE condition into the JOIN's.

However, I see no point in GROUP BY here.

This query:

  SELECT c.id,
         c.date,
         c.time,
         c.duration,
         c.location
    FROM #tmp q, call_table c (NOLOCK)
   WHERE q.id=c.id  
     AND q.date=c.date
     AND q.time=c.time
     AND q.duration=c.duration
     AND q.location=c.location
GROUP BY c.id,
         c.date,
         c.time,
         c.duration,
         c.location

can be easilty rewritten as

  SELECT  c.id,
          c.date,
          c.time,
          c.duration,
          c.location
  FROM    call_table c (NOLOCK)
  WHERE   EXISTS
          (
          SELECT  NULL
          FROM    #tmp q
          WHERE   q.id = c.id  
                  AND q.date = c.date
                  AND q.time = c.time
                  AND q.duration = c.duration
                  AND q.location = c.location
          )

, provided that c.id is a PRIMARY KEY.

If it's not, just add DISTINCT to SELECT above.

Update:

From your plan I see that that your query uses HASH JOIN, while my uses MERGE SEMI JOIN.

The latter one is usually more efficient if you have an ordered set, but for some reason the query does not use the composite index you created, but instead performs full table scan.

This is strange, since all your values are contained within the index.

Probably (probably) this is because your fields allow NULL's.

Make sure that you use only the fields from the composite index both in WHERE conditions and in SELECT clause and, if possible, make them NOT NULL.

This should make your query use preordered resultsets in MERGE SEMI JOIN. You can tell it if you see neither TABLE SCAN nor SORT in the plan, just two INDEX SCAN's.

And two more questions:

  • Is c.id a PRIMARY KEY on call_table?
  • Is q.id a PRIMARY KEY on #tmp?

If answer to both questions is yes, then you will benefit from doing two things:

  • Defining your PRIMARY KEY as CLUSTERED on both tables
  • Rewriting your query as this:

     SELECT  c.id,
             c.date,
             c.time,
             c.duration,
             c.location
     FROM    call_table c (NOLOCK)
     JOIN    #tmp q
     ON      q.id = c.id  
             AND q.date = c.date
             AND q.time = c.time
             AND q.duration = c.duration
             AND q.location = c.location
    
Quassnoi
And, using exists is considered better practice, correct?
scottm
@scotty2012: yes, since it will use some kind of a SEMI JOIN and handle the duplicates for you.
Quassnoi
I'm comparing the two versions now. The first version takes about 10 secs, and does a full table scan on both the #tmp and call_table. Your version takes about 16 seconds. Any idea why?
scottm
Could you please post the execution plan for both queries?
Quassnoi
@Quassnoi, done. I've had to change some of the names and stuff, but I think you'll get what your looking for.
scottm
The columns do allow nulls. And, the #tmp table does not have an index, I can't change either setting, I'm just trying to prove it's causing a problem. If nothing else I can use this stuff to point out some of the problems.
scottm
@scotty2012: it's very strange, since I clearly see a PRIMARY KEY index on [db].[dbo].[queue], whatever it is, in your plan. Anyway, try rewriting the query as a plain JOIN and see if it helps.
Quassnoi
A: 

How is this query going to work? Does it mash the two tables together for a total of 1,000,000,000 in memory, then use the group clause to trim it back down? Could you explain what steps SQL takes to compile the results?

It might go something like this. Say Sql Server decides to use a hash join. It creates an in-memory hash table of #temp, with a hash based on id, date, time, duration, and location. Then it iterates over the rows in call_table. For each row, it uses the hash table to detect if a matching row exists. If it does, the row gets added to the result table. So no 1,000,000,000 rows are ever in memory.

Another option (maybe even better here) is to iterate over one table, and use the id column to do an index lookup on the other table. This requires even less memory (although it would be very beneficial if the index were in cache.)

You can see what Sql Server really does by reading the execution plan. You can enable the execution plan under the Query menu.

Andomar
OK, makes sense. I use the execution plan viewer, but I still don't quite grok the different join types, or how they might affect memory, etc.
scottm
Andomar