ansaurus

Question

Are these cross joins causing me problems?

Answer 1

+1 A:

The query you described is no way a CROSS JOIN.

SQL Server is smart enough to transform the WHERE condition into the JOIN's.

However, I see no point in GROUP BY here.

This query:

  SELECT c.id,
         c.date,
         c.time,
         c.duration,
         c.location
    FROM #tmp q, call_table c (NOLOCK)
   WHERE q.id=c.id  
     AND q.date=c.date
     AND q.time=c.time
     AND q.duration=c.duration
     AND q.location=c.location
GROUP BY c.id,
         c.date,
         c.time,
         c.duration,
         c.location

can be easilty rewritten as

  SELECT  c.id,
          c.date,
          c.time,
          c.duration,
          c.location
  FROM    call_table c (NOLOCK)
  WHERE   EXISTS
          (
          SELECT  NULL
          FROM    #tmp q
          WHERE   q.id = c.id  
                  AND q.date = c.date
                  AND q.time = c.time
                  AND q.duration = c.duration
                  AND q.location = c.location
          )

, provided that c.id is a PRIMARY KEY.

If it's not, just add DISTINCT to SELECT above.

Update:

From your plan I see that that your query uses HASH JOIN, while my uses MERGE SEMI JOIN.

The latter one is usually more efficient if you have an ordered set, but for some reason the query does not use the composite index you created, but instead performs full table scan.

This is strange, since all your values are contained within the index.

Probably (probably) this is because your fields allow NULL's.

Make sure that you use only the fields from the composite index both in WHERE conditions and in SELECT clause and, if possible, make them NOT NULL.

This should make your query use preordered resultsets in MERGE SEMI JOIN. You can tell it if you see neither TABLE SCAN nor SORT in the plan, just two INDEX SCAN's.

And two more questions:

Is c.id a PRIMARY KEY on call_table?
Is q.id a PRIMARY KEY on #tmp?

If answer to both questions is yes, then you will benefit from doing two things:

Defining your PRIMARY KEY as CLUSTERED on both tables

Rewriting your query as this:

 SELECT  c.id,
         c.date,
         c.time,
         c.duration,
         c.location
 FROM    call_table c (NOLOCK)
 JOIN    #tmp q
 ON      q.id = c.id  
         AND q.date = c.date
         AND q.time = c.time
         AND q.duration = c.duration
         AND q.location = c.location

Quassnoi 2009-07-01 16:46:10

And, using exists is considered better practice, correct?

scottm 2009-07-01 16:58:48

@scotty2012: yes, since it will use some kind of a SEMI JOIN and handle the duplicates for you.

Quassnoi 2009-07-01 17:05:40

I'm comparing the two versions now. The first version takes about 10 secs, and does a full table scan on both the #tmp and call_table. Your version takes about 16 seconds. Any idea why?

scottm 2009-07-01 17:13:06

Could you please post the execution plan for both queries?

Quassnoi 2009-07-01 17:26:29

@Quassnoi, done. I've had to change some of the names and stuff, but I think you'll get what your looking for.

scottm 2009-07-01 19:05:57

The columns do allow nulls. And, the #tmp table does not have an index, I can't change either setting, I'm just trying to prove it's causing a problem. If nothing else I can use this stuff to point out some of the problems.

scottm 2009-07-01 22:51:16

@scotty2012: it's very strange, since I clearly see a PRIMARY KEY index on [db].[dbo].[queue], whatever it is, in your plan. Anyway, try rewriting the query as a plain JOIN and see if it helps.

Quassnoi 2009-07-01 23:03:08

Answer 2

A:

How is this query going to work? Does it mash the two tables together for a total of 1,000,000,000 in memory, then use the group clause to trim it back down? Could you explain what steps SQL takes to compile the results?

It might go something like this. Say Sql Server decides to use a hash join. It creates an in-memory hash table of #temp, with a hash based on id, date, time, duration, and location. Then it iterates over the rows in call_table. For each row, it uses the hash table to detect if a matching row exists. If it does, the row gets added to the result table. So no 1,000,000,000 rows are ever in memory.

Another option (maybe even better here) is to iterate over one table, and use the id column to do an index lookup on the other table. This requires even less memory (although it would be very beneficial if the index were in cache.)

You can see what Sql Server really does by reading the execution plan. You can enable the execution plan under the Query menu.

Andomar 2009-07-01 17:11:15

OK, makes sense. I use the execution plan viewer, but I still don't quite grok the different join types, or how they might affect memory, etc.

scottm 2009-07-01 17:15:23

Andomar 2009-07-01 17:18:03

ansaurus

tags:

views:

answers:

Are these cross joins causing me problems?

Execution Plans:

related questions