views:

222

answers:

3

I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them.

I've been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).

My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).

Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?

Here is my test class http://gist.github.com/451880

+1  A: 

Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?

You might get this effect simply by having a "none of the above" pseudo-category trained each time. If the max you can train is 5 categories (though I'm not sure why it's eating up quite so much RAM), train 4 actual categories from their actual 2K docs each, and a "none of the above" one with its 2K documents taken randomly from all the other 146 categories (about 13-14 from each if you want the "stratified sampling" approach, which may be sounder).

Still feels like a bit of a kludge and you might be better off with a completely different approach -- find a multi-dimensional doc measure that defines your 300K pre-tagged docs into 150 reasonably separable clusters, then just assign each of the other yet-untagged docs to the appropriate cluster as thus determined. I don't think NLTK has anything directly available to support this kind of thing, but, hey, NLTK's been growing so fast that I may well have missed something...;-)

Alex Martelli
We have a special category of documents for which we know we can't classify correctly. It's a bit of a kludge but works quite well.
ephes
+4  A: 

How big (number of words) are your documents? Memory consumption at 150K trainingdocs should not be an issue.

Naive Bayes is a good choice especially when you have many categories with only a few training examples or very noisy trainingdata. But in general, linear Support Vector Machines do perform much better.

Is your problem multiclass (a document belongs only to one category exclusivly) or multilabel (a document belongs to one or more categories)?

Accuracy is a poor choice to judge classifier performance. You should rather use precision vs recall, precision recall breakeven point (prbp), f1, auc and have to look at the precision vs recall curve where recall (x) is plotted against precision (y) based on the value of your confidence-threshold (wether a document belongs to a category or not). Usually you would build one binary classifier per category (positive training examples of one category vs all other trainingexamples which don't belong to your current category). You'll have to choose an optimal confidence threshold per category. If you want to combine those single measures per category into a global performance measure, you'll have to micro (sum up all true positives, false positives, false negatives and true negatives and calc combined scores) or macro (calc score per category and then average those scores over all categories) average.

We have a corpus of tens of million documents, millions of training examples and thousands of categories (multilabel). Since we face serious training time problems (the number of documents are new, updated or deleted per day is quite high), we use a modified version of liblinear. But for smaller problems using one of the python wrappers around liblinear (liblinear2scipy or scikit-learn) should work fine.

ephes
Average document is about 500-1000 words.The documents can be "multilabel".
erikcw
Ok, then go for sparse tfidf-vectors suggested by @ogrisel (i forgot to mention) and one binary classifier per category. Maybe you have some non ordinal (numerical) features in your documents - you'll have to bin them appropriately.
ephes
which modified version of liblinear did you use? or what did you modify yourselves?
bene
+1 for suggesting precision/recall as measure of classifier quality
digitalarbeiter
Definitely recall / precision / f-measure for measuring performance. Fairly standard in the informatics field. (http://en.wikipedia.org/wiki/F-measure). Also recommend using k-fold cross validation (http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#K-fold_cross-validation) to do the measurement.I also agree that your performance will be better doing binary classification (either it's X or it's not) then trying to label all in one shot.
Thien
+6  A: 

You should start by converting your documents into TF-log(1 + IDF) vectors: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.

Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.

Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the cosine similarity between the document vector and each category vector and choose the most similar category as label for your document.

If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in this example of scikit-learn (this is a wrapper for liblinear as explained by @ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a scikits.learn.metrics module with routines to compute those score for a given model and given dataset.

For larger datasets: you should try the vowpal wabbit which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).

ogrisel
Vowpal wabbit is fast. But we still use batch training instead of an online learning algorithm, because liblinear (properly optimized) takes only minutes for millions of documents (we mmaped (shared) the feature-vectors so that new train or classify processes don't have to parse a file but only loop over main memory) and it performes better (i don't have the numbers right now...).
ephes
Agreed, vowpal wabbit is really interesting when the stream of data is infinite and does not fit in memory anymore e.g. when coming from the "report spam" button of a popular webmail provider :)
ogrisel
Besides... centroid classification is not much better than Naive Bayes. This paper http://www2009.org/proceedings/pdf/p201.pdf is wrong. We told them that they used test data for training (due to a bug), but discussion went nowhere... linear SVMs are still state of the art.
ephes
Is there a convenient method to handle sparse vectors in numpy/scipy? I mean without converting it to some lil/csc-matrix - many algorithms wont take a matrix... maybe i'm just stupid, but i haven't found anything behaving like a ((dim, value), ...) tuple in scipy...
ephes
@OP See http://stackoverflow.com/questions/2380394/simple-implementation-of-n-gram-tf-idf-and-cosine-similarity-in-python for an implementation of what poster is talking about using NLTK and PyLucene. Also, I think SVMs are serious overkill for this.
gilesc
In the case of scikit-learn the input is a 2d matrix so the scipy.sparse is a perfect match for use with scikits.learn (although I am not 100% sure Fabian did implement efficient mapping at the C wrapper level for all scipy.sparse representations).Also vectors are matrix with shape (1, N) so no problem here:>>> sparse.csc_matrix(np.arange(10))<1x10 sparse matrix of type '<type 'numpy.int32'>' with 9 stored elements in Compressed Sparse Column format>I haven't checked scipy source code to see if the 1D case is implemented as a special case.
ogrisel
@gilesec linear SVM are not overkill, they are the best for this (along with logistic regression with L1 and/or L2 penalty): they are all linear models (hence efficiently implemented with sparse datastructures) and can be implemented trivially in an online fashion by solving the primal problem using stochastic gradient descent or with smarter solvers as in liblinear. We choosing between linear SVM and Logistic Regression, it's just the loss function and the regularizers that change. They both estimate directly MAP hence do not suffer from the naive bayes bias.
ogrisel
@ogrisel [1d sparse vectors] hmm, ok - i'll try it - thanks a lot :)
ephes