Ngram IDF smoothing

views:

165

answers:

+2 Q:

Ngram IDF smoothing

I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents.
I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others
The problem that I am running into is that some (3,4)-grams in my data which have super-high idf actually consist of component unigrams and bigrams which have really low idf..
For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf..
I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest.
If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores.
Anybody has any better ideas?

+1 A:

I take it that "you've never tried" is a phrase that you don't want to extract, but which has high IDF. The problem will be that there are going to be a vast number of n-grams that only occur in one document and so have the largest possible IDF score.

There are lots of smoothing techniques in NLP. This paper [Chen&Goodman] is a pretty good summary of many of them. In particular, you sound like you might be interested in the Kneser-Ney smoothing algorithm that works in the way you suggest (backing off to lower length n-grams).

These methods are usually used for the task of language modelling, i.e. to estimate the probability of an n-gram occurring given a really big corpus of the language. I don't really know how how you might integrate them with IDF scores, or even if that's really what you want to do.

StompChicken 2010-06-11 07:31:10

Most of these smoothing models look like they work with moving probability mass around from term to term in elaborate ways, to make the model be able to recognize a language better.. I haven't studied Kneser-Ney in detail yet.. but the equations look pretty complicated. Studying smoothing models did not appeal to me initially because, I felt they were concerned with redistribution probability mass (so that everything is >0 and +'s up to 1.0), and an IDF score is nothing like a probability value.

adi92 2010-06-11 17:19:06

But I guess, I could somehow adapt some of them to do something meaningful with IDFs as well. I really don't want to do something too elaborate or complicated. Was looking for a simple solution with strong justification in terms of an intuitive explanation or an academic paper with quantitative evidence

adi92 2010-06-11 17:27:04

ansaurus

tags:

views:

answers:

Ngram IDF smoothing

related questions