Hi, we have a huge table of companies (17 million entries) for which we want to find duplicates according to search critera (based on phone number). The query runs very slow (5 minutes+)
Here is a simplified version of the query, but the problem is the same :
SELECT C1.*
FROM dbo.Company AS C1 WITH(NOLOCK)
INNER JOIN dbo.Company AS C2 ON C2.sTelephone = C1.sTelephone
AND C1.iId_company != C2.iId_company
AND (C1.iId_third_party_id IS NULL OR
C2.iId_third_party_id IS NULL)
Columns explanation :
- iId_company : Primary key, integer auto increment
- sTelephone : Phone number of the company, varchar with non-clustered index on it
- iId_third_party_id : ID from a third party provider, may be null when users insert themself new companies (for that we want to find duplicates), integer with a non-clustered index on it too.
What we what is the companies with the same phone number, but different primary keys (duplicates), and also that one part doesn't have a third party ID (which tells us that a end user inserted it.
Now, I tried some things but gave me no clue :
- When removing one side in the OR clause, only remaining C1.iId_third_party_id IS NULL gives a HUGE boost, the query takes like less than 5 seconds
- When removing completely the condition in parenthesis with the OR clause, the query then comes slow again (1 minute +), but I think it's only because the data set to return in very huge.
I ended up making UNION to combine both queries (each with its part on the OR condition) but I would like to understand why there is so much difference when using that OR in the condition.