views:

212

answers:

1

I'm trying to compute item-to-item similarity along the lines of Amazon's "Customers who viewed/purchased X have also viewed/purchased Y and Z". All of the examples and references I've seen are for either computing item similarity for ranked items, for finding user-user similarity, or for finding recommended items based on the current users' history. I'd like to start off with a non-targeted approach before factoring in the current users' preferences.

Looking at the Amazon.com recommendations white paper, they use the following logic for offline item-item similarity:

For each item in product catalog, I1 
  For each customer C who purchased I1
    For each item I2 purchased by customer C
       Record that a customer purchased I1 and I2
  For each item I2 
    Compute the similarity between I1 and I2

If I understand correctly, by the time we're at "Compute similiarty between I1 and I2", I have a list of items(I2) purchased in conjunction with a single value I1(the outer loop).

How is this calculation performed?

Another idea is that I'm overthinking this and making it more difficult than I need to - Would it be enough to do a top-n query on the count of I2 bought in conjunction with I1?

I also appreciate suggestions on whether or not this approach is a correct one. My product database has about 150k items at any time. Since the bulk of the reading material I've seen shows user-item similarity or even user-user similarity, should I be looking to go that route instead.

I've worked with similarity algorithms in the past but they've always involved a rank or a score. I think the only way this would work would be to build a customer-product matrix scoring 0/1 for not purchased/purchased. Given the purchase history and the item size, this could get really large.

edit: although i listed python as a tag, i'd prefer to keep the logic inside of a db, preferably using Oracle PL/SQL.

+1  A: 

There's a good O'Reilly book on this topic. While the whitepaper might lay the logic out in pseudo-code like that, I don't think that approach would scale very well. The calculations are all probability calculations, so things like Bayes' Theorem get used to say, "Given Person A purchased X, what's the likelihood they purchased Z?" Straightforward looping over the data is working too hard. You have to go through it all for each person.

Tom
I have the book but its examples all consider something that's rated , in the books case, movies and critics (At least on the chapter on similarity). For instance, "Given my ratings, show me other[movies|critics] that I would like." My data is just purchased, I can derive not purchased if I have to.I dont mind using Bayes but I'm not looking for the likelihood that user A will purchase X. I'm more interested in showing that purchasers of A also bought Z. Disclaimer - I might not understand this as well as I think I do with respects to wanting item-item vs user-item.
Neil Kodner
@Neil, use purchase/no-purchase as ratings of 0 and 1 -- not maximal computational efficiency, but conceptually shows how, if you know how to deal w/ratings, then **of course** you know how to deal w/just purchases as well! And it _does_ have to use Bayes (or an approx thereof) to make any sense, otherwise you can't cut off the amounts of items to show on the "also purchased" list and you would end up with millions of items there (==totally useless) with a vast catalog and many users, and too large numbers anyway even for a very modest e-commerce site.
Alex Martelli