views:

250

answers:

5

I'm thinking of writing an app to classify movies in an HTPC based on what the family members like.

I don't know statistics or AI, but the stuff here looks very juicy. I wouldn't know where to start do.

Here's what I want to accomplish:

  1. Compose a set of samples from each users likes, rating each sample attribute separately. For example, maybe a user likes western movies a lot, so the western genre would carry a bit more weight for that user (and so on for other attributes, like actors, director, etc).

  2. A user can get suggestions based on the likes of the other users. For example, if both user A and B like Spielberg (connection between the users), and user B loves Batman Begins, but user A loathes Katie Holmes, weigh the movie for user A accordingly (again, each attribute separately, for example, maybe user A doesn't like action movies so much, so bring the rating down a bit, and since Katie Holmes isn't the main star, don't take that into account as much as the other attributes).

Basically, comparing sets from user A similar to sets from user B, and come up with a rating for user A.

I have a crude idea about how to implement this, but I'm certain some bright minds have already thought of a far better solution already, so... any suggestions?

Actually, after a quick research, it seems a Bayesian filter would work. If so, would this be the better approach? Would it be as simple as just "normalizing" movie data, training a classifier for each user, and then just classify each movie?

If your suggestion includes some brain melting concepts (I'm not experienced in these subjects, specially in AI), I'd appreciate it if you also included a list of some basics for me to research before diving into the meaty stuff.

Thanks!

+2  A: 

There are a few algorithms that are good for this:

ARTMAP: groups via probability against each other (this isn't fast but its the best thing for your problem IMO)

ARTMAP holds a group of common attributes and determines likelyhood of simliarity via a percentages. ARTMAP

KMeans: This seperates out the vectors by the distance that they are from each other KMeans: Wikipedia

PCA: will seperate the average of all the values from the varing bits. This is what you would use to do face detection, and background subtraction in Computer Vision. PCA

monksy
Thanks. Upon reading about ARTMAP, it seems like a good candidate. Since I understand code better than scientific papers, I found this http://users.visualserver.org/xhudik/art/doc/index.html and spawned this http://stackoverflow.com/questions/1609296/artmap-adaptive-resonance-theory-implementatios-basics ... KMeans looks interesting too, but one at a time :)
Ivan
This book has a really easy intro tutorial onit... http://www.amazon.com/AI-Application-Programming-Tim-Jones/dp/1584502789 However, IRC the probability Fn has an error in it.
monksy
+2  A: 

The K-nearest neighbor algorithm may be right up your alley.

hythlodayr
That one looks simple enough for me to try and implement myself. A good learning resource, thanks!
Ivan
+4  A: 

This is similar to this question where the OP wanted to build a recommendation system. In a nutshell, we are given a set of training data consisting of users ratings to movies (1-5 star rating for example) and a set of attributes for each movie (year, genre, actors, ..). We want to build a recommender so that it will output for unseen movies a possible rating. So the inpt data looks like:

user movie   year   genre   ...    | rating
---------------------------------------------
  1    1     2006   action         |    5
  3    2     2008   drama          |    3.5
  ...

and for an unrated movie X:

10    20     2009   drama   ?

we want to predict a rating. Doing this for all unseen movies then sorting by predicted movie rating and outputting the top 10 gives you a recommendation system.

The simplest approach is to use a k-nearest neighbor algorithm. Among the rated movies, search for the "closest" ones to movie X, and combine their ratings to produce a prediction. This approach has the advantage of being very simple to easy implement from scratch.

Other more sophisticated approaches exist. For example you can build a decision tree, fit a set of rules on the training data. You can also use Bayesian networks, artificial neural networks, support vector machines, among many others... Going through each of these wont be easy for someone without the proper background. Still I expect you would be using an external tool/library. Now you seem to be familiar with Bayesian Networks, so a simple naive bayes net, could in fact be very powerful. One advantage is that it allow for prediction under missing data.

The main idea would be somewhat the same; take the input data you have, train a model, then use it to predict the class of new instances.

If you want to play around with different algorithms in simple intuitive package which requires no programming, I suggest you take a look at Weka (my 1st choice), Orange, or RapidMiner. The most difficult part would be to prepare the dataset to the required format. The rest is as easy as choosing what algorithm and applying it (all in a few clicks!)

I guess for someone not looking to go into too much details, I would recommend going with the nearest neighbor method as it is intuitive and easy to implement.. Still the option of using Weka (or one of the other tools) is worth looking into.

Amro
Awesome answer, thanks. I'm going to dive into Weka this weekend and see what I come up with.
Ivan
+1  A: 

Check out some of the work of the top teams for the netflix prize.

jilles de wit