views:

62

answers:

4

Hi,

I have a performance problem on a query.

First table is a Customer table which has millions records in it. Customer table has a column of email address and some other information about customer.

Second table is a CommunicationInfo table which contains just Email addresses.

And What I want in here is; how many times the email address in CommunicationInfo table repeats in Customers table. What could be the the most performer query.

The basic query that I can explain this situation is;

Select ci.Email, count(*) from Customer c left join 
CommunicationInfo ci on c.Email1 = ci.Email or c.Email2 = ci.Email 
Group by ci.Email

But sure, it takes about 5, 6 minutes in execution.

Thanks in Advance.

+3  A: 

this query is about as good as it gets if you have an index on Customer.Email and another on CommunicationInfo.Email

Select
    c.Email, count(*)
    from Customer c
        left join CommunicationInfo  ci on c.Email1 = ci.Email
        left join CommunicationInfo ci2 on c.Email2 = ci2.Email
    Group by c.Email
KM
Key is Index. Make sure that email column is indexed properly.
Raja
If `ci.Email` is a `PRIMARY KEY`, this query is a synonym for `SELECT email, COUNT(*) FROM customer`. If it's not, this query is a messy cartesian join.
Quassnoi
+1  A: 

Using the OR condition robs the optimizer of opportunity to use HASH JOIN or MERGE JOIN.

Use this:

SELECT  ci.Email, SUM(cnt)
FROM    (
        SELECT  ci.Email, COUNT(c.Email) AS cnt
        FROM    CommunicationInfo ci
        LEFT JOIN
                Customer c
        ON      c.Email1 = ci.Email
        GROUP BY
                ci.Email
        UNION ALL
        SELECT  ci.Email, COUNT(c.Email) AS cnt
        FROM    CommunicationInfo ci
        LEFT JOIN
                Customer c
        ON      c.Email2 = ci.Email
        GROUP BY
                ci.Email
        ) q2
GROUP BY
        ci.Email

or this:

SELECT  ci.Email, COUNT(*)
FROM    CommunicationInfo ci
LEFT JOIN
        (
        SELECT  Email1 AS email
        FROM    Customer c
        UNION ALL
        SELECT  Email2
        FROM    Customer
        ) q
ON      q.Email = ci.Email
GROUP BY
        ci.Email

Make sure that you have indexes on Customer(Email) and Customer(Email2)

The first query will be more efficient if your emails are mostly not filled, the second one — if most emails are filled.

Quassnoi
I have a feeling that since there are millions of records in the Customer table, hitting IO twice against wouldn't necessarily make it run faster, but the thought crossed my mind as well.
Jeremy
@Jeremy: as long as `Email` and `Email2` are indexed, this will just get all keys from two indexes in a sequential fashion. If all or almost all fields are filled and have a match, the same thing would need to be done, only in a less efficient nested loops (random read) fashion.
Quassnoi
Ah yes, you mention that indexes need to be on the columns. My bad.
Jeremy
+1  A: 

You mention:

And What I want in here is; how many times the email address in CommunicationInfo table repeats in Customers table. What could be the the most performer query.

To me, that sounds like you could easily use an INNER JOIN - this would most likely be a lot faster, since it will limit the search scope to just those customers who really do have an e-mail - anyone who doesn't have an e-mail at all (and thus a count(*) = 0) will not even be looked at - that might make a big difference even just in the number of rows SQL Server has to count and group.

So try this:

SELECT 
   ci.Email, COUNT(*) 
FROM 
   dbo.Customer c 
INNER JOIN dbo.CommunicationInfo ci 
   ON c.Email1 = ci.Email OR c.Email2 = ci.Email  
GROUP BY
   ci.Email

How does that perform in your case??

marc_s
It really takes so much time to execute.
yapiskan
A: 

Depending on your environment there may not be much you can do to optimize this.

A couple of questions:

  1. How many records in CommunicationInfo?
  2. How often do you really need to run this query? Is it a one time analysis, or are multiple people going to be running this every 10 minutes?
  3. Are the fields indexed? I'll make a guess that neither Email1 nor Email2 field is indexed. However, I wouldn't suggest adding an index without taking the balance of the whole system into consideration.
  4. Why are you using a left join? Do you really need EVERYTHING from the Customer table? You're counting, so no harm in doing an INNER JOIN.

Suggestions:

  1. Run the query through the Query Optimization wizard to see if there is anything SQL Server would recommend.
  2. An extreme suggestion would be to dump the Email1 and Email2 columns into a temp table and join to that. I've seen queries run slowly because of a large amount of stress on a particular table, so sometimes copying the records into a temp table is faster, but this technique is very dependent on how much memory there is, how fast IO is, and the amount of stress on a particular table.
Jeremy