tags:

views:

96

answers:

5

I have a subquery problem that is causing poor performance... I was thinking that the subquery could be re-written using a join, but I'm having a hard time wrapping my head around it.

The gist of the query is this: For a given combination of EmailAddress and Product, I need to get a list of the IDs that are NOT the latest.... these orders are going to be marked as 'obsolete' in the table which would leave only that latest order for a a given combination of EmailAddress and Product... (does that make sense?)

Table Definition

CREATE TABLE  `sandbox`.`OrderHistoryTable` (
 `id` INT( 11 ) NOT NULL AUTO_INCREMENT ,
 `EmailAddress` VARCHAR( 100 ) NOT NULL ,
 `Product` VARCHAR( 100 ) NOT NULL ,
 `OrderDate` DATE NOT NULL ,
 `rowlastupdated` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ,
PRIMARY KEY (  `id` ) ,
KEY  `EmailAddress` (  `EmailAddress` ) ,
KEY  `Product` (  `Product` ) ,
KEY  `OrderDate` (  `OrderDate` )
) ENGINE = MYISAM DEFAULT CHARSET = latin1;

Query

SELECT id
FROM
OrderHistoryTable AS EMP1
WHERE
OrderDate not in 
   (
   Select max(OrderDate)
   FROM OrderHistoryTable AS EMP2
   WHERE 
       EMP1.EmailAddress =  EMP2.EmailAddress
   AND EMP1.Product IN ('ProductA','ProductB','ProductC','ProductD')
   AND EMP2.Product IN ('ProductA','ProductB','ProductC','ProductD')
   )

Explanation of duplicate 'IN' statements

13   [email protected]  ProductA  2010-10-01
15   [email protected]  ProductB  2010-20-02
46   [email protected]  ProductD  2010-20-03
57   [email protected]  ProductC  2010-20-04
158  [email protected]  ProductE  2010-20-05
206  [email protected]  ProductB  2010-20-06
501  [email protected]  ProductZ  2010-20-07

The results of my query should be | 13 | | 15 | | 46 | | 57 |

This is because, in the orders listed, those 4 have been 'superceded' by a newer order for a product in the same category. This 'category' contains prodcts A, B, C & D.

Order ids 158 and 501 show no other orders in their respective categories based on the query.

Final Query based off of accepted answer below: I ended up using the following query with no subquery and got about 3X performance (30 sec down from 90 sec). I also now have a separate 'groups' table where I can enumerate the group members instead of spelling them out in the query itself...

SELECT DISTINCT id, EmailAddress FROM (
  SELECT a.id, a.EmailAddress, a.OrderDate
  FROM OrderHistoryTable a
  INNER JOIN OrderHistoryTable b ON a.EmailAddress = b.EmailAddress
  INNER JOIN groups g1  ON  a.Product = g1.Product 
  INNER JOIN groups g2  ON  b.Product = g2.Product 
  WHERE 
        g1.family = 'ProductGroupX'
    AND g2.family = 'ProductGroupX'
  GROUP BY a.id, a.OrderDate, b.OrderDate
  HAVING  a.OrderDate < MAX(b.OrderDate)
) dtX
+4  A: 

Use:

   SELECT a.id
     FROM ORDERHISTORYTABLE AS a
LEFT JOIN (SELECT e.EmailAddress,
                  e.product,
                  MAX(OrderDate) AS max_date
             FROM OrderHistoryTable AS e
            WHERE e.Product IN ('ProductA','ProductB','ProductC','ProductD')
         GROUP BY e.EmailAddress) b ON b.emailaddress = a.emailaddress
                                   AND b.max_date = a.orderdate
                                   AND b.product = a.product
    WHERE x.emailaddress IS NULL
      AND a.Product IN ('ProductA','ProductB','ProductC','ProductD')
OMG Ponies
This looks good, but I'm getting "#1054 - Unknown column 'emp1.Product' in 'where clause'"
BrianAdkins
@ OMG Ponies:So my original query takes about 90 seconds to run (but I have to run it many times over again with different product sets)... I just tried your revised query and I killed the process at the 3 minute mark as it was marked as **DEAD** ... any ideas?
BrianAdkins
@BrianAdkins: I re-added the filtration in the derived table for the LEFT JOIN -- that should minimize the amount of processing, but I'd hoped to consolidate it. You've got separate indexes -- what about a covering index using emailaddress, product and orderdate columns?
OMG Ponies
A: 

Make view of this sub query and do join with view. It will multiply by 5 to ur query performance

seed_of_tree
+1  A: 

My MySQL is a bit rusty (I'm used to MSSQL), but here's my best guess. It might need a bit of tweaking in the GROUP BY and HAVING clauses. Also, I assumed from your duplicate IN statements that you want the Products to match in both tables. If this isn't the case, I'll adjust the query.

SELECT a.id
FROM OrderHistoryTable a
INNER JOIN OrderHistoryTable b
    ON a.Product = b.Product AND
       a.Employee = b.Employee
WHERE a.Product IN ('ProductA','ProductB','ProductC','ProductD')
GROUP BY a.id, a.OrderDate, b.OrderDate, 
HAVING b.OrderDate < MAX(a.OrderDate)

Edit: removed extraneous AND.

jwiscarson