views:

135

answers:

2

I've been looking at building a 'people who like x, also like y' type recommendation system, and was looking at using Vogoo, but after looking through their code it seems there is a lot of nearest neighbor based on ratings.

Over the last few weeks I've seen a few articles stating that most people either don't rate at all, or rate a 5 http://youtube-global.blogspot.com/2009/09/five-stars-dominate-ratings.html

I don't currently have a ratings system implemented, and I don't really see the need to implement it if all the applicable ratings don't fluctuate.

Does this mean that KNN isn't really valuable?

Does anybody have any recommendations for developing a system to get recommendations of similar likeness based on past viewing history (passive filtering)?

The data I'm working with is event based, so if you've looked at mens doubles-tennis, blue jays baseball, college womens basket ball, etc. I'd recommend other events that are currently in your area which others who looked at similar events across the entire system have also viewed.

I mostly work with PHP, but have been starting to learn Python (and probably need to learn Java, if that helps).

+2  A: 

Well, the curt answer to your first question would be no. If you have no variation in your data (YouTube stars), it's difficult to make a recommendation.

What I might suggest is trying to expand the amount of data you have. For the YouTube example, instead of just looking at the star ratings, also consider the percentage of the video that was watched. Lots of pausing, seeking, rewinding might mean that the user liked the video and wanted to see parts more often, so it should get a boost from that.

The standard way of doing recommendation, at least in the music world, is to come up with a distance metric that you can use, which gives you a distance between any two pieces of music. Then when you find out the type of music a user likes, you can pick one that's similar to their tastes by picking songs that are "close" according to the distance metric. They are also called similarity matrices, where two items with high distance would have low similarity.

So the question comes down to how you generate these similarities. One way you could do it would be to count how many people that watched show A also watched show B. If you do this for every pair of events, you'll be able to make recommendations from the corpus you've analyzed. Unfortunately, this doesn't extend well to making recommendations for events where you don't already know how many people watched them (live events instead of recorded ones).

This is at least a start though.

Andrew
Thanks for your great response Andrew. I had actually forgotten that I had posted this here. I've responded with how I actually ended up 'solving' this, but I think the answer is that recommendations based on votes isn't effective. If i had the type of data which you recommend (pausing/rewinding, etc) that would be great. However, though I have a large dataset, it isn't that deep. Great answer though. Your method is the same basically as what i've done, but the answer to 'but events are in the future' is to look at event type, not the actual event.
pedalpete
A: 

After Andrews great response, I've decided to explain what I've done and hope it may help others (though it may be specific to my implementation).

Keeping in mind that I've got data on LOTS of events and where those events take place.

The script I used to build recommendations was this one. http://www.codediesel.com/php/item-based-collaborative-filtering-php/

However, without having any ratings already in the system, and due to the 'questionable' value of user based ratings, I created ratings based on the similarities I already had in the data set.

I basically structured it like this

1) User one goes to mens tennis matches. 
2) Get all other users who go to mens tennis matches. 
3) For each user who goes to mens tennis matches, what other sports do those users go to?
4) For each  of the other sports, how many users attended those events as a count.
I used that count as the score, for the sports on the first user. 
5) Then, for each user who went to tennis, I built a 'similarity to first user' based on how many other sports they went to, and the score of those sports to the first user. 
6) This created a distance score for each user, and I applied that distance score as a score on each of the sports the secondary user went to. 
7) All of this was put into an array and passed to the recommendation linked to above

This actually worked surprisingly better than I had expected based on the sample size I was working with.

However, it is painfully slow to run. Not sure how I'll progress from here.

pedalpete