views:

42

answers:

2

I have a large sparse matrix representing attributes for millions of entities. For example, one record, representing an entity, might have attributes "has(fur)", "has(tail)", "makesSound(meow)", and "is(cat)".

However, this data is incomplete. For example, another entity might have all the attributes of a typical "is(cat)" entity, but it might be missing the "is(cat)" attribute. In this case, I want to determine the probability that this entity should have the "is(cat)" attribute.

So the problem I'm trying to solve is determining which missing attributes each entity should contain. Given an arbitrary record, I want to find the top N most likely attributes that are missing but should be included. I'm not sure what the formal name is for this type of problem, so I'm unsure what to search for when researching current solutions. Is there a scalable solution for this type of problem?

My first is to simply calculate the conditional probability for each missing attribute (e.g. P(is(cat)|has(fur) and has(tail) and ... )), but that seems like a very slow approach. Plus, as I understand the traditional calculation of conditional probability, I imagine I'd run into problems where my entity contains a few unusual attributes that aren't common with other is(cat) entities, causing the conditional probability to be zero.

My second idea is to train a Maximum Entropy classifier for each attribute, and then evaluate it based on the entity's current attributes. I think the probability calculation would be much more flexible, but this would still have scalability problems, since I'd have to train separate classifiers for potentially millions attributes. In addition, if I wanted to find the top N most likely attributes to include, I'd still have to evaluate all the classifiers, which would likely take forever.

Are there better solutions?

+1  A: 

If you have a large data set and you're worried about scalability, then I would look into Apache Mahout. Mahout is a Machine Learning and Data Mining library that might help you with your project, in particular they have some of the most well known algorithms already built-in:

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier
  • High performance java collections (previously colt collections)
Lirik
Thanks, I've heard of Mahout. It looks interesting, although I'm not familiar with all the implemented algorithms. Can you recommend the ones that are most applicable to my problem?
Chris S
Naive Bayes classifier might be pretty useful also K-Means, SVD, etc (different algos different benefits). You might actually try [Ensemble Learning](http://en.wikipedia.org/wiki/Ensemble_learning), which is the combination of multiple machine learning algorithms in order to get a better result. The winners of the NetFlix challenge got the best results by combining multiple algorithms, so if you don't want to develop your own algorithms from scratch and you want to combine a lot of algorithms, then I would recommend really taking a look into Mahout.
Lirik
Following StompChicken's advice, the recommendation problem is typically solved by collaborative filtering, which is implemented by Mahout's "Taste" component (https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation).
Chris S
@Chris, I think that Mahout has a little bit of a learning and integration overhead, but you it would be much faster than writing your own algorithms from scratch you will spend a lot more time. It's not wrong either way, but you just have to make the best choice given your circumstances.
Lirik
+1  A: 

This sounds like a typical recommendation problem. For each attribute use the word 'movie rating' and for each row use the word 'person'. For each person, you want to find the movies that they will probably like but haven't rated yet.

You should look at some of the more successful approaches to the Netflix Challenge. The dataset is pretty large, so efficiency is a high priority. A good place to start might be the paper 'Matrix Factorization Techniques for Recommender Systems'.

StompChicken
Great re-phrasing of my problem.
Chris S