views:

237

answers:

9
+2  Q: 

Limit Per Criteria

Hi,

I have an articles table and a categories table. I want to fetch 7 articles for each category. Currently I have this but it's terrible slow on large tables so it's not really a solution:

SELECT id, 
       title, 
       categories_id, 
       body, 
       DATE_FORMAT(pubdate, "%d/%m/%y %H:%i") as pubdate   
FROM articles AS t 
WHERE ( 
    SELECT COUNT(*) 
    FROM articles 
    WHERE t.categories_id = categories_id 
      AND id< t.id AND publish = 1 
      AND expires > '2008-12-14 18:38:02' 
      AND pubdate <= '2008-12-14 18:38:02' 
    ) < 7 
ORDER BY categories_id DESC

Using explain, it shows me it's doing a join type ALL & REF. The select types are PRIMARY and DEPENDENT SUBQUERY .

Is there a better solution?

A: 

You have several options - some might result in performance issues but it depends on many factors.

You could split it into several queries. One query to read out all the categories:

SELECT categories_id FROM Categories

Then for each category, read out the top seven articles:

SELECT 
  id, 
  title, 
  ...etc.
FROM articles
where categories_id = 1

... and so on for each category.

This has the benefit of being a bit simpler to understand but the downside is it turns one query into 1 + (1* number of cats). Then again, you could put a limit on the number of categories so you have some element of control. Sometimes you find that 5 simple queries perform a lot better than 1 complicated one!

This kind of assumes you are calling the SQL from some code you have control of - is this the case?

Jennifer
I've never seen the case where decomposing a sql statement with a subquery into a loop with just the subquery has improved performance, unless there was some fixable fault in the first.
le dorfier
+2  A: 
  1. How big are the tables, and how slow is slow?

  2. What indexes are there on the tables?

  3. What is the entire information from EXPLAIN?

Also, the two datetime values are explicit, so it looks like this is either being generated from code generated by something composing it from other information. Is there another SQL query of some kind, which is executing this within a loop on a list?

It's not clear which 7 articles are being chosen - the most recent? By which date?

le dorfier
A: 

In testing I found that the Limit 7 doesn't work within subqueries in MySQL, please see Bill's suggestion which I verified that it works well.

SELECT id, 
       title, 
       categories_id, 
       body, 
       DATE_FORMAT(pubdate, "%d/%m/%y %H:%i") as pubdate   
FROM articles A INNER JOIN articles B ON B.categories_ID = A.Categories_ID
WHERE A.ID IN ( 
    SELECT ID
    FROM Articles  
    WHERE categories_id = A.categories_id 
      AND publish = 1 
      AND expires > '2008-12-14 18:38:02' 
      AND pubdate <= '2008-12-14 18:38:02' 
    LIMIT 7
    ORDER BY Categories_ID DESC) 
ORDER BY B.Categories_ID DESC
Turnkey
Only since the most recent versions has MySQL supported subqueries; and optimization isn't very good on them. So this is testable, but needs to be compared with other types of approaches.
le dorfier
Putting the LIMIT 7 into the subquery isn't going to limit the number of rows - it just limits the number of records in the subquery to match on. Better to SELECT DISTINCT in the subquery - then it will just look at one row per category, not 7
le dorfier
Good point, originally it was ambiguous whether he wanted the top 7 rows in each category or the catetories with less than 7, and has lately clarified it as the the latest 7 articles of each catetory.
Turnkey
I defer to the wisdom of Bill on this one whose suggested method is much better.
Turnkey
A: 

Thank you for your input :)

Jennifer, yes, that is the other solution I was looking at since I do have control over the code generating the query but I was wondering if there was a better option with a single query..

The indexes are:

PRIMARY KEY (id), KEY publish (publish), KEY pubdate (pubdate), KEY expires (expires), KEY categories_id (categories_id)

The full explain output is here : http://skitch.com/sofia/6xhk/subquery

I just noticed there was something missing in the query I posted - the idea is to get the latest 7 articles of each category , so pubdate DESC should be in the order clause .

A while ago it took 139 seconds in a table with 6000 articles more or less. As the nr of articles might grow this query's clearly taking too long.

Turnkey, I'm going to test your query and i'll get back with feedback but if anyone else has other ideas now that i've posted more info, please say so :) - dunno what's the best solution really: 1 query or 11 queries (I have around 10 categories and they shouldn't grow)

sofia
+1  A: 

So it looks like you're asking for those categories with fewer than 7 articles; so that's where the query should start --

SELECT categories_id,  COUNT(1)  
FROM articles  
WHERE publish = 1 
  AND expires > '2008-12-14 18:38:02' 
  AND pubdate <= '2008-12-14 18:38:02'
GROUP BY categories_id
HAVING COUNT(1) < 7

Then make that a subquery with:

SELECT 
    c.id, c.title, c.id, a.body,  
    DATEFORMAT(a.pubdate, "%d/%m/%y %H:%i") as pubdate  
FROM categories c  
JOIN articles a ON c.id = a.categories_id  
JOIN 
(  
    SELECT DISTINCT categories_id  
    FROM articles  
    WHERE publish = 1  
        AND expires > '2008-12-14 18:38:02'  
        AND pubdate <= '2008-12-14 18:38:02'  
    GROUP BY categories_id  
    HAVING COUNT(1) <= 7  
) AS j ON c.id = j.categories_id  
ORDER BY whatever

The next step is to limit the number of articles returned to 7 - I can deal with that next if this looks proper. (Try it as-is and see what EXPLAIN looks like.)

EDIT: Changed "< 7" to <= 7"

le dorfier
A: 

Turnkey - your query gives me the following error http://skitch.com/sofia/6tbh/subquery1

Le dorfier your's gives me 'Every derived table must have an alias' (http://skitch.com/sofia/6tnb/subquery2) ?

sofia
I revised that statement, adding "as j ON c.id = j.category_id". Sorry
le dorfier
I revised it but need to test it in MySQL, will try to get hold of one to test it on.
Turnkey
+4  A: 

Here's how I'd solve this problem:

SELECT a1.id, 
       a1.title, 
       a1.categories_id, 
       a1.body, 
       DATE_FORMAT(a1.pubdate, "%d/%m/%y %H:%i") as pubdate  
FROM articles AS a1
  LEFT OUTER JOIN articles AS a2
  ON (a1.categories_id = a2.categories_id AND 
     (a1.pubdate < a2.pubdate OR (a1.pubdate = a2.pubdate AND a1.id < a2.id)))
GROUP BY a1.id
HAVING COUNT(*) < 7;

A correlated subquery usually performs poorly, so this technique uses a join instead.

For a given article, search for articles (a2) that match the category of the current article under consideration (a1), and have a more recent date (or higher id in the case of a tie). If there are fewer than seven articles that meet that criteria, then the current one must be among the most recent of its category.

If you can rely on the unique id column having the same sort order as pubdate, then you can simplify the join because there will be no ties over a unique column:

  ON (a1.categories_id = a2.categories_id AND a1.id < a2.id)
Bill Karwin
Good suggestion, that looks like the right way to do it, should work in all cases and should be fast.
Turnkey
A: 

Although Bill's query might work a bit better on average, on a single run it took 230 secs. I didn't do a full test (several runs) but it's still too slow so i guess the better option would be to do 1 query per category fetching the latest 7 items - that looks like it's going to be faster than all other options.

sofia
Do you have indexes set up for the columns in the WHERE criteria?
Turnkey
A: 
sofia
Maybe change "COUNT(1) < 7" to "COUNT(1) <= 7"?
le dorfier
It's because the <7 means only the categories that have less than 7 articles which is not what you originally wanted.
Turnkey
It duplicates accurately what the query in the original posting does. I figured a comparison with the actual query was wanted, since that's what the timings came from.
le dorfier
I presume it's because this is part of the answer to the larger question: Show me this months' articles, but at least the last 7 articles regardless of date. This gives the set of articles requiring the second rule.
le dorfier