Hi!
i am storing in lucene index ngrams up to level 3. When I am reading the index and calculating scoring of terms and ngrams I am obtaining results like this
TERM FREQUENCY.... TFIDF
minority 25 16.512926
minority report 24 16.179296
report 27 13.559037
cruise 12 11.440491
tom cruise 7 8.737819
So, if we look at the example of "tom cruise", together as bigram it occurs 7 times. And from this we see that "cruise" occurs alone 5 times. So I dont want this duplication of frequency, because "cruise" alone has scored better than "tom cruise", which is not true, since it is contained inside.
Sorry if i explain bad i dont know how to call this type of scoring, if someone know to explain this technical words, please edit.
Thank you