ansaurus

Question

Frequencies of lucene unigrams and bigrams

Answer 1

+1 A:

I believe I answered a similar question you asked a while ago. IIUC, you want the more important terms to stand out, and you feel that "tom cruise" is more important than "cruise".

This looks like a problem in your model of the data. TFIDF seems to be wrong for what you want. You can try building a language model, as described in Peter Norvig's "Beautiful Data" chapter.

The gist is:

Calculate a probability per each unigram, bigram and trigram (you will need smoothing or back-off as explained in the paper).
Choose your terms by probability rather than TFIDF.

A Language Model Approach to Keyphrase Extraction seems to do similar stuff. Some alternatives are Kea (which uses TFIDF as one feature among several) and Peter Turney's Keyphrase extraction work.

Yuval F 2010-08-27 20:13:13

@Yuval F :Thank you a lot for the tips. I went for the Kea, however seems to be offering more domain specific controlled vocabularies, but from Kea page i read about Maui that does same stuff with some additional features. http://code.google.com/p/maui-indexer/I see that the results i am getting are very good! However i will dig now try to see exactly the details of the algorithm and scoring calculations..Thanx!

Julia 2010-08-28 12:19:28

ansaurus

tags:

views:

answers:

Frequencies of lucene unigrams and bigrams

related questions