views:

90

answers:

3

In our Ruby on Rails project we have a lot of categorization criteria for recipes, such as cook method, occasion etc. Every recipe belongs to one or several of these categories. When someone starts browsing for recipes, he/she can narrow down to a set of particular categories. Then we need to calculate the number of recipes in all categories accessible from this set ("accessible" means there are recipes in that category that also belong to selected categories). This is similar to how Amazon search works: someone enters 'Software' and there is a menu on the left which says "Books (200)", "Movies (300)" etc, so user can go deeper by clicking on these links.

Right now we've implemented it roughly like that:

  1. Build a set of selected categories from URL;
  2. Perform a query that fetches category ids from all recipes that fall into currently selected criteria;
  3. Build the index which maps all category ids to counts of recipes, and render only those that have non-zero counters;
  4. Store this index in memcached for 24 hours, so we only calculate it once per day for a particular page.

My concern is that if there is a cache miss, building index can take a lot of time. Maybe you have any suggestions how to solve this problem or improve current solution?

A: 

Indexing every day is not very clean. Why don't you index it, when you insert or update the data set?

insert the data set (like recipes)

  • start a thread, which adds the content to the index

  • if a timeout (like 1 second) occurs on the thread (High load!), stop it

daily:

  • save the current index to disk

  • update the whole index

  • if it fails, recover the saved index from disk

  • else read the the index to memcache

Beffa
A: 

Hello Oleg,

You didn't come with any estimations on the number of categories/products, but i will assume there are lots of them :)

If i would want performance, here is my approach: (i know, it's crazy :) )

  • for each category, keep in memcache a bit vector, meaning: the n-th bit is 1 if the product with id n belongs to that category

Let me give you an example: if products 1, 7, 9 and 10 are in category A and 1,6,9 are in category B and 1, 9, 11 are in C then:

  • A is 01000001 01100000
  • B is 01000010 01000000
  • C is 01000000 01010000

When you want to compute the intersection of those sets, just make a bitwise AND between your sets and you will have your result.

The result is:

  • RESULT = A AND B AND C = 01000000 01000000

If you want to compute for each category, just make another CATEGORY AND RESULT

Remarks:

  • don't forget to recompute those vectors on changing something in the DB
  • this is very fast if you plan to intersect a lot of categories
  • for each category, you will have to store a vector larger than TOTAL_NR_OF_PRODUCTS/8
Vlad Zloteanu
If you decide to do it this way, you can save some time and coding by doing it as a bloom filter: http://www.rubyinside.com/bloom-filters-a-powerful-tool-599.html (But I still think the problem itself is overkill and probably an unnecessary feature.)
SFEley
+1  A: 

What you're describing is a really bad combinatorics problem: for every selected category, iterate every recipe, then iterate categories for that recipe and then return a recipe count for that category. Even with optimized SQL you're talking about nested subselects, and logically this cannot be done in less than exponential time. (Meaning this is going to really hurt when you get a lot of recipes.) And with the number of possible combinations being equal to (categories)^2, caching gets more and more impractical too.

Are you sure you have to do it this way? You're wrong about Amazon, BTW; they don't have "crossover category views" like this. They show counts of search hits, which is easy with a search index. Putting in "Software" in the search box isn't treating software as a category; it's treating it as a keyword.

If no one is demanding this feature, I'd suggest simplifying it. On your category filter view, just show all the recipes that match. On each recipe page, you can show the sidebar list of all categories that that recipe is in, and counts for those categories if you feel like it. (Which can easily be cached as an attribute in the Categories model, and retrieved with eager loading when you bring up the recipe.)

If you do have to do this for some reason -- the Powers That Be demand it under the mistaken impression that users really want to see categories they didn't filter on -- then at least do it with SQL. Nested subselects do hurt and will chew up your database's memory, but they'll be faster than doing it in Ruby. Also, there are Rails plugins that will alter the behavior of caching, so that you show the expired result on the current hit and then regenerate the cache for the next hit.

But I'd seriously suggest tracking clicks and determining if anyone uses this before putting in more work.

SFEley