views:

272

answers:

7

I'm looking for techniques to generate 'neighbours' (people with similar taste) for users on a site I am working on; something similar to the way last.fm works.

Currently, I have a compatibilty function for users which could come into play. It ranks users on having 1) rated similar items 2) rated the item similarly. The function weighs point 2 heigher and this would be the most important if I had to use only one of these factors when generating 'neighbours'.

One idea I had would be to just calculate the compatibilty of every combination of users and selecting the highest rated users to be the neighbours for the user. The downside of this is that as the number of users go up then this process couls take a very long time. For just a 1000 users, it needs 1000C2 (0.5 * 1000 * 999 = = 499 500) calls to the compatibility function which could be very heavy on the server also.

So I am looking for any advice, links to articles etc on how best to achieve a system like this.

A: 

The problem seems like to be 'classification problems'. Yes there are so many solutions and approaches.

To start exploration check this: http://en.wikipedia.org/wiki/Statistical_classification

popopome
+5  A: 

In the book Programming Collective Intelligence
http://oreilly.com/catalog/9780596529321

Chapter 2 "Making Recommendations" does a really good job of outlining methods of recommending items to people based on similarities between users. You could use the similarity algorithms to find the 'neighbours' you are looking for. The chapter is available on google book search here:
http://books.google.com/books?id=fEsZ3Ey-Hq4C&printsec=frontcover

Mark Roddy
I can highly recommend this book. You'll find what you need in there.
Joeri Sebrechts
Also, as for the duration of the calculation process, it will take a long time to run for a large amount of users. You'll want to batch this and have the site display the results of the most recent run.
Mark Roddy
A: 

Have you heard of kohonen networks?

Its a self organing learning algorithm that clusters similar variables into similar slots. Although most sites like the one I link you to displays the net as bidimensional there is little involved in extending the algorithm into a multiple dimension hypercube.

With such a data structure finding and storing neighbours with similar tastes is trivial as similar users should be stores into similar locations (almost like a reverse hash code).

This reduces your problem into one of finding the variables that will define similarity and establishing distances between possible enumerate values ,like for example classical and acoustic are close toghether while death metal and reggae are quite distant (at least in my oppinion)

By the way in order to find good dividing variables the best algorithm is a decision tree. The nodes closer to the root will be the most important variables to establish 'closeness'.

Caerbanog
I personally find the Kohonen networks (self-organizing maps -SOMs) are very hard to intuitively understand. Do you have good recommendations about implementations and explanations that explicate the math involved?
Gregg Lind
A: 

It looks like you need to read about clustering algorithms. The general idea is that instead of comparing every point with every other point each time you divide them in clusters of similar points. Then the neighborhood may be all the points in the same cluster. The number/size of the clusters is usually a parameter of the clustering algorithm.

Yo can find a video about clustering in Google's series about cluster computing and mapreduce.

Ryszard Szopa
A: 

Concerns over performance can be greatly mitigated if you consider this as a build/batch problem rather than a realtime query.

The graph can be statically computed then latently updated e.g. hourly, daily etc. to then generate edges and storage optimized for runtime query e.g. top 10 similar users for each user.

+1 for Programming Collective Intelligence too - it is very informative - wish it wasn't (or I was!) as Python-oriented, but still good.

stephbu
+1  A: 

Be sure to look at Collaborative Filtering. Many recommendation systems use collaborative filtering to suggest items to users. They do it by finding 'neighbors' and then suggesting items your neighbors rated highly but you haven't rated. You could go as far as finding neighbors, and who knows, maybe you'll want recommendations in the future.

GroupLens is a research lab at the University of Minnesota that studies collaborative filtering techniques. They have a ton of published research as well as a few sample datasets.

The Netflix Prize is a competition to determine who can most effectively solve this sort of problem. Follow the links off their LeaderBoard. A few of the competitors share their solutions.

As far as a computationally inexpensive solution, you could try this:

  • Create categories for your items. If we're talking about music, they might be classical, rock, jazz, hip-hop... or go further: Grindcore, Math Rock, Riot Grrrl...
  • Now, every time a user rates an item, roll up their ratings at the category level. So you know 'User A' likes Honky Tonk and Acid House because they give those items high ratings frequently. Frequency and strength is probably important for your category aggregate score.
  • When it's time to find neighbors, instead of cruising through all ratings, just look for similar scores in the categories.

This method wouldn't be as accurate but it's fast.

Cheers.

Corbin March
+1  A: 

What you need is a clustering algorithm, which would automatically group similar users together. The first difficulty that you are facing is that most clustering algorithms expect the items they cluster to be represented as points in a Euclidean space. In your case, you don't have the coordinates of the points. Instead, you can compute the value of the "similarity" function between pairs of them.

One good possibility here is to use spectral clustering, which needs precisely what you have: a similarity matrix. The downside is that you still need to compute your compatibility function for every pair of points, i. e. the algorithm is O(n^2).

If you absolutely need an algorithm faster than O(n^2), then you can try an approach called dissimilarity spaces. The idea is very simple. You invert your compatibility function (e. g. by taking its reciprocal) to turn it into a measure of dissimilarity or distance. Then you compare every item (user, in your case) to a set of prototype items, and treat the resulting distances as coordinates in a space. For instance, if you have 100 prototypes, then each user would be represented by a vector of 100 elements, i. e. by a point in 100-dimensional space. Then you can use any standard clustering algorithm, such as K-means.

The question now is how do you choose the prototypes, and how many do you need. Various heuristics have been tried, however, here is a dissertation which argues that choosing prototypes randomly may be sufficient. It shows experiments in which using 100 or 200 randomly selected prototypes produced good results. In your case if you have 1000 users, and you choose 200 of them to be prototypes, then you would need to evaluate your compatibility function 200,000 times, which is an improvement of a factor of 2.5 over comparing every pair. The real advantage, though, is that for 1,000,000 users 200 prototypes would still be sufficient, and you would need to make 200,000,000 comparisons, rather than 500,000,000,000 an improvement of a factor of 2500. What you get is O(n) algorithm, which is better than O(n^2), despite a potentially large constant factor.

Dima