views:

56

answers:

2

I have a table that stores a pupil_id, a category and an effective date (amongst other things). The dates can be past, present or future. I need a query that will extract a pupil's current status from the table.

The following query works:

SELECT * 
FROM pupil_status 
WHERE (status_pupil_id, status_date) IN (
    SELECT status_pupil_id, MAX(status_date) 
    FROM pupil_status 
    WHERE status_date < NOW() -- to ensure we ignore the "future status"
    GROUP BY status_pupil_id );

In MySQL, the table is defined as follows:

CREATE TABLE IF NOT EXISTS `pupil_status` (
  `status_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `status_pupil_id` int(10) unsigned NOT NULL, -- a foreign key
  `status_category_id` int(10) unsigned NOT NULL, -- a foreign key
  `status_date` datetime NOT NULL, -- effective date/time of status change
  `status_modify` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `status_staff_id` int(10) unsigned NOT NULL, -- a foreign key
  `status_notes` text NOT NULL, -- notes detailing the reason for status change
  PRIMARY KEY (`status_id`),
  KEY `status_pupil_id` (`status_pupil_id`,`status_category_id`),
  KEY `status_pupil_id_2` (`status_pupil_id`,`status_date`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1409 ;

However, with 950 pupils and just over 1400 statuses in the table, the query takes 0.185 seconds to process. Perhaps acceptable now, but when the table swells, I'm worried about scalability. It is likely that the production system will have over 10000 pupils and each will have 15-20 statuses each.

Is there a better way to write this query? Are there better indexes that I should have to assist the query? Please let me know.

+1  A: 

Find out how long query takes when you load the system with 10000 pupils each with have 15-20 statuses each.

Only refactor if it takes too long.

jeffo
@jeffo seems too sensible ;-) I'll generate some random data quickly and see what happens...
Philip
@jeffo -- a long, long time...
Philip
joins, indexes, and in memory caches are your friends then...
jeffo
+3  A: 

There are the following things you could try

1 Use an INNER JOIN instead of the WHERE

SELECT * 
FROM pupil_status ps
INNER JOIN 
    (SELECT status_pupil_id, MAX(status_date) 
    FROM pupil_status 
    WHERE status_date < NOW()
    GROUP BY status_pupil_id) X
ON ps.status_pupil_id = x.status_pupil_id
AND ps.status_date = x.status_date

2 Have a variable and store the value for NOW() - I am not sure if the DB engine optimizes this call to NOW() as just one call but if it doesnt, then this might help a bit

These are some suggestions however you will need to compare the query plans and see if there is any appreciable improvement or not. Based on your usage of indexes as per the Query plan, robob's suggestion above could also come in handy

InSane
@InSane thanks! I didn't realise that there would be such an incredible difference between the WHERE...IN and INNER JOIN. With 200000 statuses for 10000 pupils, the query returns in 0.08 seconds compared with my query above where the query takes in excess of 5 minutes (and then I get bored of waiting). I did not change the key though, so not sure that its necessary.
Philip