views:

108

answers:

5

I'm trying to write code to pull a list of product items from a SQL Server database an display the results on a webpage.

A requirement of the project is that a list of categories is displayed at the right hand side of the page as a list of checkboxes (all categories selected by default) and a user can uncheck categories and re-query the database to view products's in only the categories they want.

Heres where it starts to get a bit hairy.

Each product can be assinged to multiple categories using a product categories table as below...

Product table
[product_id](PK),[product_name],[product_price],[isEnabled],etc...

Category table
[CategoryID](PK),[CategoryName]

ProductCagetory table

[id](PK),[CategoryID](FK),[ProductID](FK)

I need to select a list of products that match a set of category ID's passed to my stored procedure where the products have multiple assigned categories.

The categort id's are passed to the proc as a comma delimited varchar i.e. ( 3,5,8,12 )

The SQL breaks this varchar value into a resultset in a temp table for processing.

How would I go aout writing this query?

A: 

This should do. Yo don't have to break the comma delimited category ids.

select distinct p.* 
from product p, productcategory pc
where p.product_id = pc.productid
and pc.categoryid in ( place your comma delimited category ids here)

This will give the products which are in any of the passed in category ids i.e., as per JNK's comment its an OR not ALL. Please specify if you want an AND i.e, the product needs to be selected only if it is in ALL the categories specified in the comma separated list.

Faisal Feroz
The problem with placing the distinct so high on the execution tree is that it will have to sort potentially quite a lot of products to get the 'distict' ones and this sort is (very) costly. Is better to get the distinct product IDs first (ie. sort only IDs) and then fetch the rest of the product info.
Remus Rusanu
Using DISTINCT as a band-aid for on the need for a semi-join: not so great.
Emtucifor
A: 

This should be fairly close to what you are looking for

SELECT product.*
FROM   product
JOIN   ProductCategory ON ProductCategory.ProductID = Product.product_id
JOIN   #my_temp ON #my_temp.category_id = ProductCategory.CategoryID

EDIT

As noted in the comments this will produce duplicates for those products appearing in multiple categories. To correct this then specify DISTINCT before the column list. I have included all product columns in the list product.* as I do not know which columns you are looking for but you should probably change that to the specific columns that you want

Steve Weet
If a product belongs to multiple categories this query will produce duplicates.
Remus Rusanu
Cheers Steve, worked a treat
carrot_programmer_3
@Remus. Indeed it does. Edited to reflect that
Steve Weet
Using DISTINCT as a band-aid on the need for a semi-join: not so great.
Emtucifor
A: 

If you need anything else than product_id from products then you can write something like this (and adding the extra fields that you need):

SELECT distinct(p.product_id)
FROM product_table p
JOIN productcategory_table pc
ON p.product_id=pc.product_id
WHERE pc.category_id in (3,5,8,12);

on the other hand if you need really just the product_id you can simply select them from productcategory_table:

SELECT distinct(product_id)
FROM productcategory_table
WHERE category_id in (3,5,8,12);
Daniel Lenkes
Same as Steve's query, this produces duplicates for products in multiple categories.
Remus Rusanu
your right fixed, thx
Daniel Lenkes
Using DISTINCT as a band-aid on the need for a semi-join: not so great.
Emtucifor
+2  A: 

One problem is passing the array or list of selected categories into the server. The subject was covered at large by Eland Sommarskog in the series of articles Arrays and Lists in SQL Server. Passing the list as a comma separated string and building a temp table is one option. There are alternatives, like using XML, or a Table-Valued-Parameter (in SQL Server 2008) or using a table @variable instead of a #temp table. The pros and cons of each are covered in the article(s) I linked.

Now on how to retrieve the products. First things first: if all categories are selected then use a different query that simply retrieves all products w/o bothering with categories at all. This will save a lot of performance and considering that all users will probably first see a page w/o any category unselected, the saving can be significant.

When categories are selected, then building a query that joins products, categories and selected categories is fairly easy. Making it scale and perform is a different topic, and is entirely dependent on your data schema and actual pattern of categories selected. A naive approach is like this:

select ...
from Products p
where p.IsEnabled = 1
and exists (
  select 1  
  from ProductCategories pc
  join #selectedCategories sc on sc.CategoryID = pc.CategoryID
  where pc.ProductID = p.ProductID);

The ProductsCategoriestable must have an index on (ProductID, CategoryID) and one on (CategoryID, ProductID) (one of them is the clustered, one is NC). This is true for every solution btw. This query would work if most categories are always selected and the result contains most products anyway. But if the list of selected categories is restrictive then is better to avoid the scan on the potentially large Products table and start from the selected categories:

with distinctProducts as (
select distinct pc.ProductID
from ProductCategories pc
join #selectedCategories sc on pc.CategoryID = sc.CategoryID)
select p.*
from Products p
join distinctProducts dc on p.ProductID = dc.ProductID;

Again, the best solution depends largely on the shape of your data. For example if you have a very skewed category (one categoru alone covers 99% of products) then the best solution would have to account for this skew.

Remus Rusanu
BTW I assumed from the explanation that the result must match *any* category (at least one). As other have pointed out, is a different problem if the product has to match *all* categories.
Remus Rusanu
A: 

This gets all products that are at least in all of the desired categories (no less):

select * from product p1 join (
  select p.product_id from product p 
  join ProductCategory pc on pc.product_id = p.product_id
  where pc.category_id in (3,5,8,12)
  group by p.product_id having count(p.product_id) = 4
) p2 on p1.product_id = p2.product_id

4 is the number of categories in the set.

This gets all products that are exactly in all of the desired categories (no more, no less):

select * from product p1 join (
  select product_id from product p1 
  where not exists (
    select * from product p2 
    join ProductCategory pc on pc.product_id = p2.product_id
    where p1.product_id = p2.product_id
    and pc.category_id not in (3,5,8,12)
  )
  group by product_id having count(product_id) = 4
) p2 on p1.product_id = p2.product_id

The double negative can be read as: get all products for which there are no categories that are not in the desired category list.

For the products in any of the desired categories, it's as simple as:

select * from product p1 where exists (
  select * from product p2 
  join ProductCategory pc on pc.product_id = p2.product_id
  where 
    p1.product_id = p2.product_id and
    pc.category_id in (3,5,8,12)
)
Jordão
I am NOT fond of the use of IN when what you really mean is a JOIN. Sure, the optimizer is smart enough to convert it to a JOIN in *most* cases rather than expanding it to an OR clause, but relying on that is, well, an unreliable shortcut, almost a hack.
Emtucifor
@Emtucifor: I agree! Since the inner queries are the important ones for the example, I didn't put much thought on the outer ones. Fixed that now. The last query uses an exists "join" so that it doesn't generate duplicate rows.
Jordão