views:

520

answers:

1

Hi,

I feel like this question has probably been asked a thousand times already, so I apologize if it's been answered. And if so, can someone point me to the right posts/links?

What I'm trying to do is build a faceted navigation for my site. It uses MySQL and here's a rough sketch of the tables I'm using:

products:
- id
- title
- description
attributes:
- product_id
- name
- value
categories:
- id
- name
products_to_categories:
- product_id
- category_id

What I want to do is display a list of available attributes when you are in a category, allowing you to select one or more values for each of those attributes. To give you an example, look at this page from Office Depot: http://www.officedepot.com/a/browse/binders/N=5+2177/

So far I've used a lot of joins to filter on multiple attributes:

SELECT products.*, a_options.*
FROM products_to_categories AS pc, products,
attributes AS a_options,    /* list of attribute/value pairs I can continue to refine on */
attributes AS a_select1     /* first selected attribute */
attributes AS a_select2     /* second selected attribute */
...
WHERE pc.category_id = 1
AND products.id = pc.product_id
AND a_options.product_id = products.id
AND a_options.name != 'Color' AND a_options.name != 'Size'
AND a_select1.product_id = products.id
AND a_select1.name = 'Color' AND (a_select1.value = 'Blue' OR a_select1.value = 'Black')
AND a_select2.product_id = products.id
AND a_select2.name = 'Size' AND a_select2.value = '8.5 x 11'

Basically a_options will return all the attributes for those products that are a subset of the filters I've applied using a_select1 and a_select2. So if I use the Binders example from Office Depot, I want to show all available attributes after selecting Blue or Black for Color and "8.5 x 11" for the Size.

I then use PHP code to remove duplicates and arrange the resulting attributes into an array like this:

attributes[name1] = (val1, val2, val3, ...)
attributes[name2] = (val1, val2, val3, ...)

Is there a way I can speed up my query or write it more efficiently? I have setup indexes on the name and value in the attributes table (and also on all the ID numbers). But if someone selects a couple of attributes, then the query runs slow.

Thanks for your help in advance,
Sridhar

+2  A: 

"I then use PHP code to remove duplicates"

It will not scale then.

After I read http://www.amazon.com/Data-Warehouse-Toolkit-Techniques-Dimensional/dp/0471153370 I was rolling out facets & filtering mechanisms non stop.

The basic idea is you use a star schema..

You create a fact table that stores facts

customerid | dateregisteredid | datelastloginid
1 | 1 | 1
2 | 1 | 2

You use foreign keys into dimension tables that store attributes

date_registered
Id | weekday | weeknumber | year | month | month_year | daymonth | daymonthyear
1 | Wed      | 2            | 2009 | 2   |2-2009      | 4        | 4-2-2009

Then whichver date "paradigm" you are using, grab all the ids from that dimension table and

 select * from the fact table where the fact.dateregisteredid is IN( ... the ids from the date dimension table that represent your time period)

These "indexed views" of your data should reside in a seperate database, and a change to an object in production should queue that record for re-indexing in the analytics system. Large sites might batch their records at non-peak times to the stats reporting application always lags behind a few hours or days. I always try to keep it up to the second, if the architecture supports it.

If you are displaying rowcount previews, you might have quite some optimization or caching to implement as well.

Basically to sum it up, you copy data and denormalize. The technique goes by the name "data warehousing" or OLAP (online analytics processing).

There are better ways, using commercial databases like Oracle, but the star schema makes it available to anyone with an open source relational database and some time.

You should definitely read the toolkit but he discusses a lot of things that can save you considerable time. Like strategies for dealing with updated data, and retaining audit history in the reporting application. For every problem he outlines multiple solutions, each of which are applicable in different contexts.

It can scale up to millions of rows if you don't take the easy ways out and use a ton of needless joins.

Josh Ribakoff
It's ROLAP, to be precise (Relational OLAP) since data are still stored stored in relational structured, while OLAP means storing it in a multidimensional cube.
deadbeef