views:

1286

answers:

4

Hi, I have this query:

SELECT p.id, r.status, r.title
FROM page AS p
    INNER JOIN page_revision as r ON r.pageId = p.id AND (
     r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId AND r2.status = 'active')
     OR r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
    )

Which returns each page and the latest active revision for each, unless no active revision is available, in which case it simply returns the latest revision.

Is there any way this can be optimised to improve performance or just general readability? I'm not having any issues right now, but my worry is that when this gets into a production environment (where there could be a lot of pages) it's going to perform badly.

Also, are there any obvious problems I should be aware of? The use of sub-queries always bugs me, but to the best of my knowledge this cant be done without them.

Note:
The reason the conditions are in the JOIN rather than a WHERE clause is that in other queries (where this same logic is used) I'm LEFT JOINing from the "site" table to the "page" table, and If no pages exist I still want the site returned.

Jack

Edit: I'm using MySQL

A: 

In MS SQL 2005+ and Oracle:

SELECT p.id, r.status, r.title
FROM (
  SELECT p.*, r,*,
         ROW_NUMBER() OVER (PARTITION BY p.pageId ORDER BY CASE WHEN p.status = 'active' THEN 0 ELSE 1 END, r.id DESC) AS rn
  FROM page AS p, page_revision r
  WHERE r.id = p.pageId
  ) o
WHERE rn = 1

In MySQL that can become a problem, as subqueries cannot use the INDEX RANGE SCAN as the expression from the outer query is not considered constant.

You'll need to create two indexes and a function that returns the last page revision to use those indexes:

CREATE INDEX ix_revision_page_status_id ON page_revision (page_id, id, status);

CREATE INDEX ix_revision_page_id (page_id, id);

CREATE FUNCTION `fn_get_last_revision`(input_id INT) RETURNS int(11)
BEGIN
  DECLARE id INT;
  SELECT r_id
  INTO id
  FROM (
    SELECT r.id
    FROM page_revisions
    FORCE INDEX (ix_revision_page_status_id)
    WHERE page_id = input_id
      AND status = 'active'
    ORDER BY id DESC 
    LIMIT 1
    UNION ALL
    SELECT r.id
    FROM page_revisions
    FORCE INDEX (ix_revision_page_id)
    WHERE page_id = input_id
    ORDER BY id DESC 
    LIMIT 1
  ) o
  LIMIT 1;
  RETURN id;
END;

SELECT po.id, r.status, r.title
FROM (
  SELECT p.*, fn_get_last_revision(p.page_id) AS rev_id
  FROM page p
) po, page_revision r
WHERE r.id = po.rev_id;

This will efficiently use index to get the last revision of the page.

P. S. If you will use codes for statuses and use 0 for active, you can get rid of the second index and simplify the query.

Quassnoi
Thanks, but I'm using MySQL, so I don't think that will work.
Jack Sleight
A: 

Your problem is a particular case of what is described in this question.

The best you can get using standard ANSI SQL seems to be:

SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id 
AND r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)

Other approaches are available but dependent on what database you're using. I'm not really sure it can be improved much for MySQL.

Il-Bhima
That doesn't seem to have any preference for the latest "active" revision, as my original query did, or have I missed something?
Jack Sleight
+2  A: 

Maybe a little re-factoring is in order?

If you added a latest_revision_id column onto pages your problem would disappear, hopefully with only a couple of lines added to your page editor.

I know it's not normalized but it would simplify (and greatly speed up) the query, and sometimes you do have to denormalize for performance.

Greg
+2  A: 

If "active" is the first in alphabetical order you migt be able to reduce subqueries to:

SELECT p.id, r.status, r.title
FROM page AS p
    INNER JOIN page_revision as r ON r.pageId = p.id AND 
        r.id = (SELECT r2.id 
                FROM page_revision as r2 
                WHERE r2.pageId = r.pageId 
                ORDER BY r2.status, r2.id DESC
                LIMIT 1)

Otherwise you can replace ORDER BY line with

ORDER BY CASE r2.status WHEN 'active' THEN 0 ELSE 1 END, r2.id DESC

These all come from my assumptions on SQL Server, your mileage with MySQL may vary.

ssg
Ah ha, I'd done something similar using "FIELD(`r2`.`status`, 'active')", but this is even better, thanks. :-)
Jack Sleight
Works perfectly in MySQL.
Jack Sleight
You should still note that we are adding additional ordering. If you only have "active" and "inactive" this should be the same, but if you have more than that, the thing will sort by status first. To avoid that you might still wanna use ORDER BY CASE.
ssg