views:

722

answers:

1

Hi

i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index.

For example i am getting from previously opened IndexReader ir the documents with ids 2 and 4. Document d1 = ir.document(2); Document d2 = ir.document(4);

How can i get the cosine similarity between these two documents?

Thank you

+3  A: 

When indexing, there's an option to store term frequency vectors.

During runtime, look up the term frequency vectors for both documents using IndexReader.getTermFreqVector(), and look up document frequency data for each term using IndexReader.docFreq(). That will give you all the components necessary to calculate the cosine similarity between the two docs.

An easier way might be to submit doc A as a query (adding all words to the query as OR terms, boosting each by term frequency) and look for doc B in the result set.

bajafresh4life
Yes ok for the first, i use the termfreqvector to get what i want, but i wanted to check how much faster would it be the to get similarity from lucene.For the second part of your answer, i checked in the javadoc that there is not an obvious way to get similarity score. Ok, i can look for doc B in the result set but the only i can get is its position in the TopDocs, not the exact similarity score between these two document vectors that i want.
maiky