views:

228

answers:

6

Hello,

I have a ton of short stories about 500 words long and I want to categorize them into one of, let's say, 20 categories:

  • Entertainment
  • Food
  • Music
  • etc

I can hand-classify a bunch of them, but I want to implement machine learning to guess the categories eventually. What's the best way to approach this? Is there a standard approach to machine learning I should be using? I don't think a decision tree would work well since it's text data...I'm completely new in this field.

Any help would be appreciated, thanks!

+1  A: 

Unless there is a chance that you want to do another 500 classifications in the future I am not sure I would go for a machine learning approach.

Unless the categories are very similar ("food" and "italian food" to take an example) I think a quite naive heuristic could work very well.

For each category build a table of common words (for food : "potato", "food", "cook", "tomato", "restaurant",...) and for each text count which category got the most word matches. Instead of building the dictionary by hand you could take a sample (say a 100) of the texts, categorize them by hand and then let an algorithm pick out the words and then make sure to remove words that are common between all sets (since they provide no information). This is, in essence, a very simple "learning" system.

If you really want a machine learning system there are a number of methods for classification. The downside is that although most methods are quite simple to implement, the hard part is to chose a good method, the right features, and good parameters.

kigurai
This is a good point. Machine learning might not be the best approach for this. Regexps all the way!
Jasie
+1  A: 

Try Weka... it's a free data mining tool that implements a lot of machine learning algorithms. It has a GUI and an API, so you can use it directly to on your data set or you can program against it.

If you like the results from the various machine learning algorithms and you're still interested in implementing your own algorithms, then you can implement the one(s) that you like the most. This will also help you remove some of the "will it actually work" feeling that you normally get before you build an ML/AI algorithm.

Lirik
Thanks, this is a good idea. I've used Weka before but didn't quite understand the backend; perhaps I can dig deeper.
Jasie
+2  A: 

I think the paper "Machine learning in automated text categorization" (you can Google and download the PDF file) is worth reading. The paper discussed two crucial part:one for feature selection(translate text to feature space), the other for building a classifier on feature space. there is a lot of feature selection methods, and several classification methods(decision tree, naive Bayes, kNN, SVM, etc). you can try some combination to see if it was working on your data set.
I did something similar before, I use Python for text manipulation, feature selection, and feature weighting. and Orange for classifier. Orange and Weka already included naive Bayes, kNN... , but nowadays I might write the classifier with Python script directly, it shouldn't be very hard too.
Hope this helps.

sunqiang
Thanks for the link, the discussion was interesting.
Jasie
+6  A: 

A naive Bayes will most probably work for you. The method is like this:

  • Fix a number of categories and get a training data set of (document, category) pairs.
  • A data vector of your document will be sth like a bag of words. e.g. Take the 100 most common words except words like "the", "and" and such. Each word gets a fixed component of your data vector (e.g. "food" is position 5). A feature vector is then an array of booleans, each indicating whether that word came up in the corresponding document.

Training:

  • For your training set, calculate the probability of every feature and every class: p(C) = number documents of class C / total number of documents.
  • Calculate the probability of a feature in a class: p(F|C) = number of documents of class with given feature (= word "food" is in the text) / number of documents in given class.

Decision:

  • Given an unclassified document, the probability of it belonging to class C is proportional to P(C|F1, ..., F500) = P(C) * P(F1|C) * P(F2|C) * ... * P(F500|C). Pick the C that maximizes this term.
  • Since multiplication is numerically difficult, you can use the sum of the logs instead, which is maximized at the same C: log P(C|F1, ..., F500) = log P(C) + log P(F1|C) + log P(F2|C) + ... + log P(F500|C).
bayer
Cool, thanks for the explanation. I read something similar to this in Raghavan, Schütze, and Manning's book: http://nlp.stanford.edu/IR-book/information-retrieval-book.html, and it makes sense.
Jasie
+2  A: 

I've classified tens of thousands of short texts. What I did intially was to use a tf-idf vector space model and then did k-means clustering on those vectors. This is a very good initial step of exploratory data analysis to get a nice handle on your dataset. The package I used to cluster was cluto: http://glaros.dtc.umn.edu/gkhome/views/cluto/

To do tf-idf, I just wrote a quick script in perl to tokenize on non-alphanumerics. Then, every document consists of a bag of words. Every document is represented as a vector of the words it contains. The value of each index of the vector is the term frequency (tf) * inverse document frequency (idf). It's just the product of the count of that word/term in the document multiplied by the reciprocal of the fraction of the documents that contain that word. (because a word like "the" is very uninformative.)

This method will quickly get you about 80%-90% accuracy. You can then manually label the ones that are right (or more importantly: wrong) and then do supervised learning if you so choose.

This is cool, thanks for the programmatic explanation, I think I could easily port this into my favorite language.
Jasie
A: 

If you're looking for something off the shelf, you might want to try Microsoft's data mining algorithms in SQL Server:

http://msdn.microsoft.com/en-us/library/ms175595%28v=SQL.100%29.aspx

http://www.sqlserverdatamining.com

Shane Cusson