views:

445

answers:

7

Hi, we have a huge table of companies (17 million entries) for which we want to find duplicates according to search critera (based on phone number). The query runs very slow (5 minutes+)

Here is a simplified version of the query, but the problem is the same :

SELECT C1.*
FROM dbo.Company AS C1 WITH(NOLOCK)
INNER JOIN dbo.Company AS C2 ON C2.sTelephone = C1.sTelephone 
                         AND C1.iId_company != C2.iId_company 
                         AND (C1.iId_third_party_id IS NULL OR 
                              C2.iId_third_party_id IS NULL)

Columns explanation :

  • iId_company : Primary key, integer auto increment
  • sTelephone : Phone number of the company, varchar with non-clustered index on it
  • iId_third_party_id : ID from a third party provider, may be null when users insert themself new companies (for that we want to find duplicates), integer with a non-clustered index on it too.

What we what is the companies with the same phone number, but different primary keys (duplicates), and also that one part doesn't have a third party ID (which tells us that a end user inserted it.

Now, I tried some things but gave me no clue :

  • When removing one side in the OR clause, only remaining C1.iId_third_party_id IS NULL gives a HUGE boost, the query takes like less than 5 seconds
  • When removing completely the condition in parenthesis with the OR clause, the query then comes slow again (1 minute +), but I think it's only because the data set to return in very huge.

I ended up making UNION to combine both queries (each with its part on the OR condition) but I would like to understand why there is so much difference when using that OR in the condition.

+2  A: 

The best way to figure out why there are such differences in performance is to examine the execution plan for the queries you tried. They can provide a lot of information. Unfortunately, I am not an SQL Server expert, so I can't tell how the execution plans can be obtained.

Cătălin Pitiș
I already looked at the execution plan, but it does not give me clues. I am not very used to analyze the execution plans.
MaxiWheat
That is unfortunate. Because when working with large amount of data, the execution plan can give a lot of hints in terms of optimization of the queries (additional indexes to be added, query hints, etc.)
Cătălin Pitiș
+2  A: 

I have no idea if this will be any help in terms of performance (since I don't have a 17mio. row table at hand to test), but how about this:

  • select the minimum amount of information you absolutely need (don't do SELECT C1.* !!)
  • group by telephone number and count occurences
  • any "telephone number group" with 2 or more occurences needs to be inspected more closely

Since you're on SQL Server 2008, you should be able to create something like this (a Common Table Expression - CTE). This should reduce the scope of search, since only (hopefully!) few entries in the Company table will be indeed duplicated - so this should limit your search and thus speed it up (or at least that's the hope!).

WITH PhoneDuplicates AS
(SELECT c.Telephone, COUNT(*) as PhoneCount
   FROM dbo.Company AS c 
   GROUP BY c.Telephone
   HAVING COUNT(*) > 1
)
SELECT 
  (list of fields from company table)
FROM
  dbo.Company AS c
INNER JOIN
  PhoneDuplicates as PD ON PD.Telephone = c.Telephone

Marc

marc_s
A: 

Something you could try is

SELECT C1.*
FROM (select * from dbo.Company where iId_third_party_id IS NULL) AS C1 WITH(NOLOCK)
INNER JOIN (select * from dbo.Company where iId_third_party_id IS NULL) AS C2 ON C2.sTelephone = C1.sTelephone 
                         AND C1.iId_company != C2.iId_company

As this has helped us before.

Antony Koch
I don't think this would work because your sub-queries create a "AND", making only results that both have NULL to be returned when only one of them needs to be NULL for the join to succeed.
MaxiWheat
A: 

I doubt both your non-clustered indexes are going to be used (sTelephone, iId_third_party_id). Are you clustering on the primary key?

Look at the estimated execution plan.

Off the top of my head without seeing the plan, I would think about adding the iId_third_party_id to the nonclustered index on sTelephone and if you aren't clustering on the primary key, then add the iId_company to the index as well.

Note that there is also a potential for cross-join multiplication of results when there are more than two duplicates for a given phone number.

Cade Roux
A: 
With Temp as
(Select *
FROM dbo.Company as c
Where c.iId_third_party_id is NULL)

Select C1.*
From temp as C1 With (NoLock)
INNER JOIN Temp AS C2 
ON C2.sTelephone = C1.sTelephone AND C1.iId_company != C2.iId_company

Something like this might work

RisingCascade
+1  A: 

Performance-wise, what is the cardinality of your filtering columns?

Maybe having only

C1.iId_third_party_id IS NULL

gave a boost to performance because SQL could tell (from the statistics built on the index) that relatively few rows met the necessary criteria. When you added

(... OR C2.iId_third_party_id IS NULL)

perhaps SQL figured the resulting join would produce so many matching rows that it would not be efficient to use the index on that column.

Similarly, how many matching/duplicate phone numbers will there be? If this situation is very infrequent, I'd do something like marc_s's query (he beat me to it), as that would fly.

A lot depends on what the data looks like--how frequently or infrequently your filtering criteria appear. Analyze that, try and understand how it is and how it might change over time, and design your queries accordingly.

Philip Kelley
When the query actually ends, it returns between 10000 and 20000 rows
MaxiWheat
A: 

the speed increase you see when removing the OR part is because an OR automaticaly does an index scan instead of a seek. by unioning them together you do 2 seeks which is faster.

try finding the dupes using the the row_number technique:

;with cteDupes(RN, DupeID, DupeTelephone) as
(
SELECT  row_number() over(partition by sTelephone order by iId_company, sTelephone) RN,
        iId_company, sTelephone
FROM    dbo.Company 
WHERE   iId_third_party_id IS NULL
)
select * from cteDupes
where RN > 1

this will return only duped rows. the bonus of this is that you only get one table pass instead of two.

Mladen Prajdic