views:

127

answers:

1

Let's say I have 3 tables (significant columns only)

  1. Category (catId key, parentCatId)
  2. Category_Hierarchy (catId key, parentTrail, catLevel)
  3. Product (prodId key, catId, createdOn)

There's a reason for having a separate Category_Hierarchy table, because I'm using triggers on Category table that populate it, because MySql triggers work as they do and I can't populate columns on the same table inside triggers if I would like to use auto_increment values. For the sake of this problem this is irrelevant. These two tables are 1:1 anyway.

Category table could be:

+-------+-------------+
| catId | parentCatId |
+-------+-------------+
|   1   | NULL        |
|   2   | 1           |
|   3   | 2           |
|   4   | 3           |
|   5   | 3           |
|   6   | 4           |
|  ...  | ...         |
+-------+-------------+

Category_Hierarchy

+-------+-------------+----------+
| catId | parentTrail | catLevel |
+-------+-------------+----------+
|   1   | 1/          | 0        |
|   2   | 1/2/        | 1        |
|   3   | 1/2/3/      | 2        |
|   4   | 1/2/3/4/    | 3        |
|   5   | 1/2/3/5/    | 3        |
|   6   | 1/2/3/4/6/  | 4        |
|  ...  | ...         | ...      |
+-------+-------------+----------+

Product

+--------+-------+---------------------+
| prodId | catId | createdOn           |
+--------+-------+---------------------+
| 1      | 4     | 2010-02-03 12:09:24 |
| 2      | 4     | 2010-02-03 12:09:29 |
| 3      | 3     | 2010-02-03 12:09:36 |
| 4      | 1     | 2010-02-03 12:09:39 |
| 5      | 3     | 2010-02-03 12:09:50 |
| ...    | ...   | ...                 |
+--------+-------+---------------------+

Category_Hierarchy makes it simple to get category subordinate trees like this:

select c.*
from Category c
    join Category_Hierarchy h
    on (h.catId = c.catId)
where h.parentTrail like '1/2/3/%'

Which would return complete subordinate tree of category 3 (that is below 2, that is below 1 which is root category) including subordinate tree root node. Excluding root node is just one more where condition.

The problem

I would like to write a stored procedure:

create procedure GetLatestProductsFromSubCategories(in catId int)
begin
    /* return 10 latest products from each */
    /* catId subcategory subordinate tree  */
end;

This means if a certain category had 3 direct sub categories (with whatever number of nodes underneath) I would get 30 results (10 from each subordinate tree). If it had 5 sub categories I'd get 50 results.

What would be the best/fastest/most efficient way to do this? If possible I'd like to avoid cursors unless they'd work faster compared to any other solution as well as prepared statements, because this would be one of the most frequent calls to DB.

Edit

Since a picture tells 1000 words I'll try to better explain what I want using an image. Below image shows category tree. Each of these nodes can have an arbitrary number of products related to them. Products are not included in the picture.

category tree

So if I'd execute this call:

call GetLatestProductsFromSubCategories(1);

I'd like to effectively get 30 products:

  • 10 latest products from the whole orange subtree
  • 10 latest products from the whole blue subtree and
  • 10 latest products from the whole green subtree

I don't want to get 10 latest products from each node under catId=1 node which would mean 320 products.

+2  A: 

Final Solution

This solution has O(n) performance:

CREATE PROCEDURE foo(IN in_catId INT)
BEGIN
  DECLARE done BOOLEAN DEFAULT FALSE;
  DECLARE first_iteration BOOLEAN DEFAULT TRUE;
  DECLARE current VARCHAR(255);

  DECLARE categories CURSOR FOR
  SELECT parentTrail 
  FROM category 
  JOIN category_hierarchy USING (catId)
  WHERE parentCatId = in_catId;
  DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done = TRUE;

  SET @query := '';

  OPEN categories;

  category_loop: LOOP
    FETCH categories INTO current;
    IF `done` THEN LEAVE category_loop; END IF;

    IF first_iteration = TRUE THEN
      SET first_iteration = FALSE;
    ELSE
      SET @query = CONCAT(@query, " UNION ALL ");
    END IF;

    SET @query = CONCAT(@query, "(SELECT product.* FROM product JOIN category_hierarchy USING (catId) WHERE parentTrail LIKE CONCAT('",current,"','%') ORDER BY createdOn DESC LIMIT 10)");

  END LOOP category_loop;
  CLOSE categories;

  IF @query <> '' THEN
    PREPARE stmt FROM @query;
    EXECUTE stmt;
    DEALLOCATE PREPARE stmt;
  END IF;

END

Edit

Due to the latest clarification, this solution was simply edited to simplify the categories cursor query.

Note: Make the VARCHAR on line 5 the appropriate size based on your parentTrail column.

hobodave
Unfortunately both of your solutions return products related to categories that are immediate children of a selected category. You're not including subordinate trees and products related to those...
Robert Koritnik
... Your first solution seems simple and **very** clever. Unfortunately you've pointed out its complexity limitations. The second one uses two things: cursor on a rather limited set of records (which may not be too problematic in this case) and prepared statements whose execution plan can't be cached by mysql (if it does something similar to ms sql). But would return expected results for immediate child categories.
Robert Koritnik
@hobodave: Just for the sake of clarity and cleverness you could include the first solution as well. Mark it as obsolete or something... But it may be helpful to someone else, because I find it quite clever.
Robert Koritnik
@Robert: added it back.
hobodave
@Robert: Please see the documentation re: how the query cache operates: http://dev.mysql.com/doc/refman/5.1/en/query-cache-operation.html My solution _will_ be cached depending on your version.
hobodave
@hobodave: I just read that mysql can also cache execution plan for certain prepared statements, which makes it similar to regular stored procedures, doesn't it?
Robert Koritnik
@Robert: A stored procedure and a prepared statement are different. In 5.0 prepared statements are _never_ cached. In some versions of 5.1 they are. Queries within a sproc are cached based on the rules shown on the above referenced page.
hobodave
@hobodave: shouldn't you also call `deallocate|drop prepare` at the end or does that prevent statement to be cached?
Robert Koritnik
@hobodave: Your solution 1 is very close, but not yet there. You are preparing a statement with as many `select ... limit 10` unions as there are categories in the whole `in_catId` category subtree. It should instead prepare just as many selects as there are immediate child categories to `in_catId`, but selects should return 10 products for each child category subtree. Do you know what I mean? i.e. If you'd click on C: (root) in Windows Explorer and there are 2 child folders on root, it should return 20 files all together. 10 latest files from each folder subtree.
Robert Koritnik
Could solution 1 be made to return products from all sub-nodes, not just imediate children, by replacing the AND c.parentCatId = in_catId with something of the form AND c.parentCatId like in_catID . '%' (forgive my rusty SQL syntax)
Al Crowley
@Robert: What?! I explicitly asked you, "Would this only return products for the immediate subcategories? Or would it include the children of children ad infinitum?" You responded that it would return "the latter, ad infinitum". This latest comment seems to completely contradict that.
hobodave
@Robert: From where I stand either my original solution, or my updated solution answer your question as posed. If they don't, I cannot discern what your intent is from your last comment. Please update your question to ask exactly what you mean. I suggest providing sample datasets and expected output with an explanation.
hobodave
@hobodave: You said either immediate children or whole subtrees. I said the later. So whole subtrees. I apologise if you misunderstood me. I thought it was self evident from the questions already. You've helped me already, because I was able to speed up my query by more tens fold... Maybe even 100x better performance.
Robert Koritnik
@hobodave: The thing is that whole subtree product listings should be grouped together to 10 latest products for each subtree. That's why I said your solution is close but not yet there. So if `in_catId` has 3 sub categories and each of them has 2 more your solution would return 90 resuts: 10 from each category under `in_catId`. It should instead return just 30. 10 from first subcategory subtree (1sub + 2subsub categories), 10 from second and 10 from third. Have I made it clear this time? If tree depths have more levels this becomes even more complex.
Robert Koritnik
@Robert: No, it is not clear. Please provide sample data and expected output.
hobodave
@hobodave: Check my **edit** section of the original question. I attached an image that should explain it better than plain data and words.
Robert Koritnik
@Al Crowley: Not exactly. It could help by checking like against `parentTrail`. The problem is not getting subtrees but effectively querying them.
Robert Koritnik
@hobodave: I don't know if my words and images are really that unclear, but seems your SQL query only gets 10 latest products **directly** attached to categories 2, 3 and 4 respectively. The same as it did in the beginning). It should return 10 latest products from **each subtree**. So if node 16 had 5 products and node 11 would have 5 products (as per category tree image), query related to node 2 should return all 10, even though they are not directly attached to category 2. But they are related to node 2, because all those products are **in the subtree of node 2** (whatever node underneath).
Robert Koritnik
@Robert: Crap you're right. I had it pictured in my mind that way, and something altogether different came out :).
hobodave
@hobodave: ok. this does it now. I've solved it before already, but didn't want to mess with your answer.
Robert Koritnik
I up-voted it as well since you've put so much effort into this answer. Well done.
Robert Koritnik
@Robert: Thanks. What was your solution? The same?
hobodave