I have a subquery problem that is causing poor performance... I was thinking that the subquery could be re-written using a join, but I'm having a hard time wrapping my head around it.
The gist of the query is this: For a given combination of EmailAddress and Product, I need to get a list of the IDs that are NOT the latest.... these orders are going to be marked as 'obsolete' in the table which would leave only that latest order for a a given combination of EmailAddress and Product... (does that make sense?)
Table Definition
CREATE TABLE `sandbox`.`OrderHistoryTable` (
`id` INT( 11 ) NOT NULL AUTO_INCREMENT ,
`EmailAddress` VARCHAR( 100 ) NOT NULL ,
`Product` VARCHAR( 100 ) NOT NULL ,
`OrderDate` DATE NOT NULL ,
`rowlastupdated` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ,
PRIMARY KEY ( `id` ) ,
KEY `EmailAddress` ( `EmailAddress` ) ,
KEY `Product` ( `Product` ) ,
KEY `OrderDate` ( `OrderDate` )
) ENGINE = MYISAM DEFAULT CHARSET = latin1;
Query
SELECT id
FROM
OrderHistoryTable AS EMP1
WHERE
OrderDate not in
(
Select max(OrderDate)
FROM OrderHistoryTable AS EMP2
WHERE
EMP1.EmailAddress = EMP2.EmailAddress
AND EMP1.Product IN ('ProductA','ProductB','ProductC','ProductD')
AND EMP2.Product IN ('ProductA','ProductB','ProductC','ProductD')
)
Explanation of duplicate 'IN' statements
13 [email protected] ProductA 2010-10-01
15 [email protected] ProductB 2010-20-02
46 [email protected] ProductD 2010-20-03
57 [email protected] ProductC 2010-20-04
158 [email protected] ProductE 2010-20-05
206 [email protected] ProductB 2010-20-06
501 [email protected] ProductZ 2010-20-07
The results of my query should be | 13 | | 15 | | 46 | | 57 |
This is because, in the orders listed, those 4 have been 'superceded' by a newer order for a product in the same category. This 'category' contains prodcts A, B, C & D.
Order ids 158 and 501 show no other orders in their respective categories based on the query.
Final Query based off of accepted answer below: I ended up using the following query with no subquery and got about 3X performance (30 sec down from 90 sec). I also now have a separate 'groups' table where I can enumerate the group members instead of spelling them out in the query itself...
SELECT DISTINCT id, EmailAddress FROM (
SELECT a.id, a.EmailAddress, a.OrderDate
FROM OrderHistoryTable a
INNER JOIN OrderHistoryTable b ON a.EmailAddress = b.EmailAddress
INNER JOIN groups g1 ON a.Product = g1.Product
INNER JOIN groups g2 ON b.Product = g2.Product
WHERE
g1.family = 'ProductGroupX'
AND g2.family = 'ProductGroupX'
GROUP BY a.id, a.OrderDate, b.OrderDate
HAVING a.OrderDate < MAX(b.OrderDate)
) dtX