views:

54

answers:

1

Hi!

i am storing in lucene index ngrams up to level 3. When I am reading the index and calculating scoring of terms and ngrams I am obtaining results like this

TERM              FREQUENCY....      TFIDF
minority           25           16.512926
minority report 24           16.179296
report           27           13.559037
cruise           12           11.440491
tom cruise        7            8.737819

So, if we look at the example of "tom cruise", together as bigram it occurs 7 times. And from this we see that "cruise" occurs alone 5 times. So I dont want this duplication of frequency, because "cruise" alone has scored better than "tom cruise", which is not true, since it is contained inside.

Sorry if i explain bad i dont know how to call this type of scoring, if someone know to explain this technical words, please edit.

Thank you

+1  A: 

I believe I answered a similar question you asked a while ago. IIUC, you want the more important terms to stand out, and you feel that "tom cruise" is more important than "cruise".

This looks like a problem in your model of the data. TFIDF seems to be wrong for what you want. You can try building a language model, as described in Peter Norvig's "Beautiful Data" chapter.

The gist is:

  • Calculate a probability per each unigram, bigram and trigram (you will need smoothing or back-off as explained in the paper).
  • Choose your terms by probability rather than TFIDF.

A Language Model Approach to Keyphrase Extraction seems to do similar stuff. Some alternatives are Kea (which uses TFIDF as one feature among several) and Peter Turney's Keyphrase extraction work.

Yuval F
@Yuval F :Thank you a lot for the tips. I went for the Kea, however seems to be offering more domain specific controlled vocabularies, but from Kea page i read about Maui that does same stuff with some additional features. http://code.google.com/p/maui-indexer/I see that the results i am getting are very good! However i will dig now try to see exactly the details of the algorithm and scoring calculations..Thanx!
Julia