views:

74

answers:

1

Hi everyone.

I'm study recommendation engines, and I went through the paper that defines how Google News generates recommendations to users for news items which might be of their interest, based on collaborative filtering.

One interesting technique that they mention is Minhashing. I went through what it does, but I'm pretty sure that what I have is a fuzzy idea and there is a strong chance that I'm wrong. The following is what I could make out of it :-

  1. Collect a set of all news items.
  2. Define a hash function for a user. This hash function returns the index of the first item from the news items which this user viewed, in the list of all news items.
  3. Collect, say "n" number of such values, and represent a user with this list of values.
  4. Based on the similarity count between these lists, we can calculate the similarity between users as the number of common items. This reduces the number of comparisons a lot.
  5. Based on these similarity measures, group users into different clusters.

This is just what I think it might be. In Step 2, instead of defining a constant hash function, it might be possible that we vary the hash function in a way that it returns the index of a different element. So one hash function could return the index of the first element from the user's list, another hash function could return the index of the second element from the user's list, and so on. So the nature of the hash function satisfying the minwise independent permutations condition, this does sound like a possible approach.

Could anyone please confirm if what I think is correct? Or the minhashing portion of Google News Recommendations, functions in some other way? I'm new to internal implementations of recommendations. Any help is appreciated a lot.

Thanks!

+1  A: 

I think you're close.

First of all, the hash function first randomly permutes all the news items, and then for any given person looks at the first item. Since everyone had the same permutation, two people have a decent chance of having the same first item.

Then, to get a new hash function, rather than choosing the second element (which would have some confusing dependencies on the first element), they choose a whole new permutation and take the first element again.

People who happen to have the same hash value 2-4 times (that is, the same first element in 2-4 permutations) are put together in a cluster. This algorithm is repeated 10-20 times, so that each person gets put into 10-20 clusters. Finally, recommendations are given based (the small number of) other people in the 10-20 clusters. Since all this work is done by hashing, people are put directly into buckets for their clusters, and large numbers of comparisons aren't needed.

mathmike
Thanks a lot for answering. It makes sense. Given that we repeat the algorithm a number of times, we can compute the similarity of two users by just comparing the list of hash values for them, which would give us a measure of how many times the two users tend to agree on a position.One question I still have is, that how are the buckets defined? In order to put a particular person into a bucket, we would need to correlate this person with other persons? Do they just compare the user with *one* user from a bucket, because all the other users in that bucket are mostly similar?Thanks again.
Siddhant
A bucket is defined as "the set of people who have the same hash values for the first 2-4 items under a random permutation".So no person-to-person comparisons are needed to be put into buckets. Like and hash table, each person gets sent straight to one bucket.I haven't read the whole paper, but I assume that they then use every user in a shared bucket to help determine recommendations. The whole goal of the buckets is to make "every user" small enough.
mathmike