views:

56

answers:

1

I'm reading the book Programming Collective Intelligence and trying to apply what I've learned against the Northwind database. Though I'm not yet confident in my understanding of the algorithms presented, I'm starting to get an idea of the general concepts.

Using the Northwind database, I'm attempting to show a list of "customers who bought this also bought XYZ" using the following pseudo-logic:

  1. Find other customers who also purchased my item
  2. Find all other items purchased by those customers
  3. Rank items based on purchase count
  4. Return top N items from previous step

I'm working with the query below:

declare 
 @customerid nchar(5),
 @productid int;

set @customerid = 'ALFKI';
set @productid = 59;

-- find other products from customers who
-- also purchased my productid

select top 10
 od.productid, c.categoryname, p.productname, p.unitsonorder, count(od.productid)
from
 [order details] od
  inner join orders o on o.orderid = od.orderid
  inner join products p on p.productid = od.productid
  inner join categories c on c.categoryid = p.categoryid
where
 o.customerid <> @customerid and
 od.productid <> @productid and 
 p.discontinued = 0
group by
 od.productid, c.categoryname, p.productname, p.unitsonorder
order by 5 desc,4 desc

I think my next step would be to break the query up so that I can filter based on recent purchases (as opposed to all historical purchases) and limit customer matching to N customers, rather than ALL customers who purchased my product. Can anyone provide any pointers? Am I headed in the right direction? Should I take a different direction entirely?

At this point, my goal is performance over accuracy, as I know I do not yet have the experience to apply the algorithms to their fullest benefit. I'm merely trying to apply the concept. Once I'm satisfied I understand it well enough, I intend to test this query against a larger database with more realistic customer data.

+1  A: 

This is usually called "market basket" analysis or "affinity grouping" and it is not as simple as it sounds, mostly due to large number of combinations. First thing to consider is what is the minimum number of cases that would result in a positive detection. For example let's say we have 1M baskets and 10 people who bought product A also bought product B in the same basket. Is 10 cases enough to say that "people who bought A also bought B"? Due to this, using "recent" purchases is a bit tricky. The idea is to create a table of product pairs, but if we have N products, number of combinations is N*(N-1), so for a store of 1000 product, we would have 999000 combinations, so algorithm which prunes this is not simple.

One more thing to consider is the order of items and value of each one. For example customers buying bicycles may often buy (add) a "LED light". But, if one puts a LED light in a basket, should the system offer a bicycle in the list?

Considering that you are using SQL server, I would point you towards Analysis Services Data Mining which uses "MS Association Algorithm" for this purpose. By using same data, you can compare your results against a "commercial solution".

Damir Sudarevic