views:

338

answers:

4

So, I'm just stabbing into the wild. I'm really not much of a data-miner. I ask out of pure interest because I really won't have time to try taking part in this contest.

But just for the fun of it, how would you tackle it?

It works something like this: You get a really large set of movie IDs and user votes. Now given a few votes by some user, and a film, which rating would he give this movie?

EDIT URL for said prize is http://www.netflixprize.com/

+1  A: 

Ok, here is my idea:

My statistics classes are gone a little. But you could do linear regressions with a mixed model, i. e. with dummy group variables to find out the individual bias of every user.

So, that would be my first step, having a model like:

movie score by a user = movie score + user bias.

every user has the same bias on all movies.

Now, construct a graph like this: every movie is a node, and for every user, add an edge, or raise its weight by one, between all pairs of movies this user likes.

Run Weighted Cluster Editing on the graph to identify clusters of movies. Adjust the definition of "likes" above, to get rather large clusters.

Now, we improve the model:

movie score by a user = movie score + user bias + cluster bias.

And well, with that I would go and predict.

Edit: Better make 5 clusterization. In one, add edges only for 5-star votes. In the next one, for 4 and 5-star votes. And so forth.

And now the model is:

movie score by a user = movie score + general bias + 5-star bias + 4-5-star bias + ... + 5-4-3-2-1-star bias

regress and predict!

nes1983
+1  A: 

Obviously I don't have a good enough idea otherwise I would be working on it instead of posting it here :)

Wired has covered the progress in the prize at for instance here. Most teams share their knowledge after a while so they are all pretty close together but it seems (as so often) that the last 20% will take 80% of the effort.

I would try to solve the problem of the movies like Napoleon Dynamite which do not fit any of the currently used graphs. Whether you like that movie doesn't seem to have anything to do with your feelings about Superman or Silence of the Lambs etc... I would think a big enough "training" set would solve this but such a set isn't feasible so instead I would try focusing on finding a way to cluster these oddball movies and then I would process them in a different way it seems a type of movie you love or hate not one that you think is OK so I would not use a non linear rating algorithm.

olle
Well, using Cluster Editing like I suggested would make oddballs a cluster of their own, I guess. Maybe I should try my idea and see how much off the mark I am.
nes1983
A: 

So, maybe for those 3 readers who are NOT completely familiar with linear regression, like me: They demand to improve their predictions by 10 %. That's tough. It's tough, because I suppose that estimating a user's choice simply by the average choice other users have given is probably a good estimator already. What I want to say is: there is not so much space left for improvements.

nes1983
A: 

You can read about how the team who won the progress prize for $50k and how they did it here: http://www.netflixprize.com/assets/ProgressPrize2008_BellKor.pdf

I don't understand most of it. Before the competition I would have guessed genetic algorithms would have been the best approach but it looks like they didn't use this.

Brian Armstrong