I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents.
I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others
The problem that I am running into is that some (3,4)-grams in my data which have super-high idf actually consist of component unigrams and bigrams which have really low idf..
For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf..
I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest.
If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores.
Anybody has any better ideas?
views:
165answers:
1I take it that "you've never tried" is a phrase that you don't want to extract, but which has high IDF. The problem will be that there are going to be a vast number of n-grams that only occur in one document and so have the largest possible IDF score.
There are lots of smoothing techniques in NLP. This paper [Chen&Goodman] is a pretty good summary of many of them. In particular, you sound like you might be interested in the Kneser-Ney smoothing algorithm that works in the way you suggest (backing off to lower length n-grams).
These methods are usually used for the task of language modelling, i.e. to estimate the probability of an n-gram occurring given a really big corpus of the language. I don't really know how how you might integrate them with IDF scores, or even if that's really what you want to do.