views:

290

answers:

3

I'm building a recommendation engine using collaborative filtering. For similarity scores, I use a Pearson correlation. This is great most of the time, but sometimes I have users that only share a 1 or 2 fields. For example:

User 1{
a: 4
b: 2
}

User 2{
a: 4
b: 3
}

Since this is only 2 data points, a Pearson correlation would always be 1 (a straight line or perfect correlation). This obviously isn't what I want, so what value should I use instead? I could just throw away all instances like this (give a correlation of 0), but my data is really sparse right now and I don't want to lose anything. Is there any similarity score I could use that would fit in with the rest of my similarity scores (all Pearson)?

A: 

I think you should calculate item similarity instead of user similarity, so you could recommend new items to the users that have few rated items.

jbochi
+1  A: 

You might want to consider using cosine similarity rather than Pearson correlation. It does not suffer from this problem, and is widely used in the recommender systems literature.

The canonical solution to this, described by Herlocker et al. in "Empirical Analysis of Design Choices in Neighborhood-based Collaborative Filtering Algorithms", is to "damp" the Pearson correlation to correct for excessively high correlation between users with small co-rating sets. Basically, you multiply the Pearson correlation by the lesser of 1 and cc/50 where cc is the number of items both users have rated. The effect is that, if they have at least 50 items in common, the similarity is raw Pearson; otherwise, it is scaled linearly with the number of rated items they have in common. It turns that spurious correlation of 1 into a similarity of 0.02.

50 may need to be adapted based on your domain and system.

You can also use cosine similarity, which does not suffer from this limitation in the same way. For user-user CF, however, Pearson correlation is generally preferred.

Disclaimer: I'm a student in the lab that produced the above-mentioned Herlocker paper.

Michael E
A: 

I know this is old but just discovered this thread and see it is not technically answered.

Yes, Pearson is commonly mentioned in recommender engine writeups, and it works reasonably, but has some quirks like this. (By the way the correlation is 1 in your example, not 0.)

The cosine measure similarity is indeed a good alternative. However if you "center" the data (shift so mean is 0) before computing, and there are reasons you should, then it reduces to be identical to the Pearson correlation. So you end up with similar issues, or else, have a different set of issues from not centering.

Consider a Euclidean distance-based similarity metric -- similarity is inversely related to distance, where user ratings are viewed as points in space. It doesn't have this sparseness problem, though it needs to be normalized for dimension in order to not favor users who co-rate many items and are thus far since their distance is increased along many dimensions.

But really, I'd suggest you look at a log-likelihood-based similarity metric. It also doesn't have these issues, and doesn't even need rating values. This is a great default.

There are more to consider that wouldn't have this issue: Spearman correlation, Tanimoto distance (Jaccard coefficient)-based.

Where can you learn more and get an implementation? Voila, Apache Mahout: http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/

Sean Owen