ansaurus

Question

T-SQL query on huge table running slow depending on join conditions

Answer 1

+2 A:

The best way to figure out why there are such differences in performance is to examine the execution plan for the queries you tried. They can provide a lot of information. Unfortunately, I am not an SQL Server expert, so I can't tell how the execution plans can be obtained.

Cătălin Pitiș 2009-10-14 16:20:38

I already looked at the execution plan, but it does not give me clues. I am not very used to analyze the execution plans.

MaxiWheat 2009-10-14 17:11:57

That is unfortunate. Because when working with large amount of data, the execution plan can give a lot of hints in terms of optimization of the queries (additional indexes to be added, query hints, etc.)

Cătălin Pitiș 2009-10-15 06:22:03

Answer 2

+2 A:

I have no idea if this will be any help in terms of performance (since I don't have a 17mio. row table at hand to test), but how about this:

select the minimum amount of information you absolutely need (don't do SELECT C1.* !!)
group by telephone number and count occurences
any "telephone number group" with 2 or more occurences needs to be inspected more closely

Since you're on SQL Server 2008, you should be able to create something like this (a Common Table Expression - CTE). This should reduce the scope of search, since only (hopefully!) few entries in the Company table will be indeed duplicated - so this should limit your search and thus speed it up (or at least that's the hope!).

WITH PhoneDuplicates AS
(SELECT c.Telephone, COUNT(*) as PhoneCount
   FROM dbo.Company AS c 
   GROUP BY c.Telephone
   HAVING COUNT(*) > 1
)
SELECT 
  (list of fields from company table)
FROM
  dbo.Company AS c
INNER JOIN
  PhoneDuplicates as PD ON PD.Telephone = c.Telephone

Marc

marc_s 2009-10-14 16:27:28

Answer 3

A:

Something you could try is

SELECT C1.*
FROM (select * from dbo.Company where iId_third_party_id IS NULL) AS C1 WITH(NOLOCK)
INNER JOIN (select * from dbo.Company where iId_third_party_id IS NULL) AS C2 ON C2.sTelephone = C1.sTelephone 
                         AND C1.iId_company != C2.iId_company

As this has helped us before.

Antony Koch 2009-10-14 16:29:38

I don't think this would work because your sub-queries create a "AND", making only results that both have NULL to be returned when only one of them needs to be NULL for the join to succeed.

MaxiWheat 2009-10-14 17:10:45

Answer 4

A:

I doubt both your non-clustered indexes are going to be used (sTelephone, iId_third_party_id). Are you clustering on the primary key?

Look at the estimated execution plan.

Off the top of my head without seeing the plan, I would think about adding the iId_third_party_id to the nonclustered index on sTelephone and if you aren't clustering on the primary key, then add the iId_company to the index as well.

Note that there is also a potential for cross-join multiplication of results when there are more than two duplicates for a given phone number.

Cade Roux 2009-10-14 16:32:35

Answer 5

A:

With Temp as
(Select *
FROM dbo.Company as c
Where c.iId_third_party_id is NULL)

Select C1.*
From temp as C1 With (NoLock)
INNER JOIN Temp AS C2 
ON C2.sTelephone = C1.sTelephone AND C1.iId_company != C2.iId_company

Something like this might work

RisingCascade 2009-10-14 16:33:38

Answer 6

+1 A:

Performance-wise, what is the cardinality of your filtering columns?

Maybe having only

C1.iId_third_party_id IS NULL

gave a boost to performance because SQL could tell (from the statistics built on the index) that relatively few rows met the necessary criteria. When you added

(... OR C2.iId_third_party_id IS NULL)

perhaps SQL figured the resulting join would produce so many matching rows that it would not be efficient to use the index on that column.

Similarly, how many matching/duplicate phone numbers will there be? If this situation is very infrequent, I'd do something like marc_s's query (he beat me to it), as that would fly.

A lot depends on what the data looks like--how frequently or infrequently your filtering criteria appear. Analyze that, try and understand how it is and how it might change over time, and design your queries accordingly.

Philip Kelley 2009-10-14 16:44:45

When the query actually ends, it returns between 10000 and 20000 rows

MaxiWheat 2009-10-14 17:01:58

Answer 7

A:

the speed increase you see when removing the OR part is because an OR automaticaly does an index scan instead of a seek. by unioning them together you do 2 seeks which is faster.

try finding the dupes using the the row_number technique:

;with cteDupes(RN, DupeID, DupeTelephone) as
(
SELECT  row_number() over(partition by sTelephone order by iId_company, sTelephone) RN,
        iId_company, sTelephone
FROM    dbo.Company 
WHERE   iId_third_party_id IS NULL
)
select * from cteDupes
where RN > 1

this will return only duped rows. the bonus of this is that you only get one table pass instead of two.

Mladen Prajdic 2009-10-14 16:46:49

ansaurus

tags:

views:

answers:

T-SQL query on huge table running slow depending on join conditions

related questions