views:

48

answers:

1

Using MySQL

So i'm trying to improve upon a query I've written. My current query works, but i feel like i could be more efficient

Essentially, I have a table that lists 'who talks to who and how much'.

The records look like this:

email name status count prod_ref (I'll post an example set with example output at the end of the post) PK = email and name name is an internal email address specific to the organization that runs this. (IE sue@innatech becomes just sue)

What i want initially is a query that will return the record that has the highest count for each email address.

The best i could come up with here is:

SELECT email, name
FROM email_tbl AS a INNER JOIN
     (SELECT email, MAX(count) cnt
      FROM email_tbl
      GROUP BY email) AS b ON a.email = b.email AND a.count = r.cnt

This seems fairly optimized. I can then further filter my results by adding a HAVING clause to the subquery to specify things like SUM(count) > 20 ect.

The part that makes it nasty is the status field. (I know, there's some data-duplication problems that break proper ER styling, but for certain reasons, i'm only allowed one table).

The status field describes if a particular email should be ignored. Theoretically, all records for an email should have identical fields, but i want to account for the case where this isnt true. Basically, if there exists a record with a status of 'ignore', i want to throw out the email that is on that record.

The only way I could think to do this was another sub query, by adding to the end

WHERE a.email NOT IN 
   (SELECT DISTINCT email
    FROM email_tbl
    WHERE status = 'ignore')

It works... but I'm always in the market to improve my tools, so i want to know if there's a better way to accomplish this.

Appendix A Example set

email_tbl

[email protected];phil;good;12
[email protected];sue;good;5
-----------------------
[email protected];phil;good;2
[email protected];sue;good;2
[email protected];fred;good;8
-----------------------
[email protected];phil;good;5
[email protected];sue;ignore;4

would return

[email protected];phil
[email protected];fred
+1  A: 

Just a couple of off the cuff queries, so you'll have to test for performance:

Using NOT EXISTS instead of IN should at least be faster than what you have as it can stop processing the subquery as soon as it finds a match. It's correlated though, so you'll have to test.

SELECT
    email,
    name
FROM
    Email_Tbl AS T1
INNER JOIN
(
    SELECT email, MAX(count) cnt
    FROM email_tbl
    GROUP BY email
) AS SQ ON
    SQ.email = T1.email AND
    SQ.cnt = T1.count
WHERE
    NOT EXISTS
    (
        SELECT *
        FROM Email_Tbl T2
        WHERE T2.email = T1.email
          AND T2.status = 'Ignore'
    )

SELECT
    email,
    name
FROM
    Email_Tbl T1
LEFT OUTER JOIN Email_Tbl T2 ON
    T2.email = T1.email AND
    (
        T2.count > T1.count OR
        T2.status = 'Ignore'
    )
WHERE
    T2.email IS NULL

SELECT
    email,
    name
FROM
    Email_Tbl T1
LEFT OUTER JOIN Email_Tbl T2 ON
    T2.email = T1.email AND
    T2.count > T1.count
LEFT OUTER JOIN Email_Tbl T3 ON
    T3.email = T1.email AND
    T3.status = 'Ignore'
WHERE
    T2.email IS NULL OR
    T3.email IS NULL
Tom H.