views:

165

answers:

1

I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents.
I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others
The problem that I am running into is that some (3,4)-grams in my data which have super-high idf actually consist of component unigrams and bigrams which have really low idf..
For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf..
I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest.
If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores.
Anybody has any better ideas?

+1  A: 

I take it that "you've never tried" is a phrase that you don't want to extract, but which has high IDF. The problem will be that there are going to be a vast number of n-grams that only occur in one document and so have the largest possible IDF score.

There are lots of smoothing techniques in NLP. This paper [Chen&Goodman] is a pretty good summary of many of them. In particular, you sound like you might be interested in the Kneser-Ney smoothing algorithm that works in the way you suggest (backing off to lower length n-grams).

These methods are usually used for the task of language modelling, i.e. to estimate the probability of an n-gram occurring given a really big corpus of the language. I don't really know how how you might integrate them with IDF scores, or even if that's really what you want to do.

StompChicken
Most of these smoothing models look like they work with moving probability mass around from term to term in elaborate ways, to make the model be able to recognize a language better.. I haven't studied Kneser-Ney in detail yet.. but the equations look pretty complicated. Studying smoothing models did not appeal to me initially because, I felt they were concerned with redistribution probability mass (so that everything is >0 and +'s up to 1.0), and an IDF score is nothing like a probability value.
adi92
But I guess, I could somehow adapt some of them to do something meaningful with IDFs as well. I really don't want to do something too elaborate or complicated. Was looking for a simple solution with strong justification in terms of an intuitive explanation or an academic paper with quantitative evidence
adi92