views:

105

answers:

3

I need to get the Vector Space Model(with tf-idf weighting) from the results of a lucene query, and cant figure out how to do it. It seems like it should be simple, and at this stage maybe one of you guys can point me in the right direction.

I have been trying to figure out how to do this for a good while, and either I haven't copped how the stuff i have read is what i need yet (more than likely), or a solution hasn't been posted to my particular problem. I even tried computing the VSM myself direct from the query results, but my solution has hideous complexity.

Edit: For anyone else who stumbles upon this, there is a solution @ the much clearer question here What i need can be gotten by the IndexReader.getTermFreqVector(String field, int docid) method.

Unfortunately this doesn't work for me as the index I am working off hasn't stored the term frequency vectors, so I guess I'm still looking for more help on this!

+1  A: 

Maybe I'm misunderstanding what you're trying to do, but Lucene's scoring uses the vector space model. If you want more details for how the scores are calculated, given a document and a query, use Searcher.explain(Query query, int doc) .

bajafresh4life
I need to be able to compute similarity between all of the results with each other, using their term vectors. As far as I can tell the lucene scoring tells you the similarity score between your query and a document.
Mark A
Submit the text of each document as the query, and you'll get the cosine similarity for that document with every other document in your index. When you transform the text of the document into a query, make sure each term is an OR term.
bajafresh4life
A: 

If I understand correctly from your comment, you want the compute VSM cosine similarity between documents rather than between a query and a document. I don't know exactly how to do this, but I'd point you to the Lucene API page for the Similarity class. You'd probably have to derive and use a custom subclass of Similarity that changes the coord and queryNorm members and find a way to turn documents into query objects.

(No guarantees; I'm just trying to figure out this scoring myself.)

larsmans
Yep, thats what I'm looking for, I'll have a fresh look at the similarity class. Thanks for your help.
Mark A
+1  A: 

To answer this question, you can compute a TF-IDF weighted vector space model for a set of lucene results using the IndexReader.getTermFreqVector() and Searcher.docFreq() classes. There is no way of directly getting the VSM for a set of results in Lucene.

(this is all as far as I can tell)

Mark A