views:

268

answers:

3

I'm trying to decide on the best similarity metric for a product recommendation system using item-based collaborative filtering. This is a shopping basket scenario where ratings are binary valued - the user has either purchased an item or not - there is no explicit rating system (eg, 5-stars).

Step 1 is to compute item-to-item similarity, though I want to look at incorporating more features later on.

Is the Tanimoto coefficient the best way to go for binary values? Or are there other metrics that are appropriate here? Thanks.

+2  A: 

It sounds like ARTMAP would be a good canidate for what you are looking for.

ARTMAP succeeds at taking items with feature vectors and determining how close they are to other items.

For example
    Features: 1   2   3  4  5 
 Item One     0   0   1  1  0 
 Item Two     0   1   0  1  0 
 Item three   1   0   1  1  0

It would (depending on the amount of tolerence) would cluster Items one and three together. Etc.

monksy
A: 

Evaluation of Item-Based Top-N Recommendation Algorithms by Karypis proposes cosine similarity and an asymmetric probabilistic similarity measure as similarity metrics for item-item CF in binary ratings contexts such as yours. Cosine similarity is commonly used and seems to perform well.

Michael E
A: 

Chiming in on this old thread: if you have "binary" ratings (either the association exists or doesn't, but has no magnitude), then indeed you are forced to look at metrics like the Tanimoto / Jaccard coefficient.

However I'd suggest a log-likelihood similarity metric is significantly better for situations like this. Here's the code from Mahout: http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/LogLikelihoodSimilarity.java?view=markup

Sean Owen