tags:

views:

169

answers:

4

i have a Lucene-Index with following documents:

doc1 := { caldari, jita, shield, planet }
doc2 := { gallente, dodixie, armor, planet }
doc3 := { amarr, laser, armor, planet }
doc4 := { minmatar, rens, space }
doc5 := { jove, space, secret, planet }

so these 5 documents use 14 different terms:

[ caldari, jita, shield, planet, gallente, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ]

the frequency of each term:

[ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ]

for easy reading:

[ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ]

What i do want to know now is, how to obtain the term frequency vector for a set of documents?

for example:

Set<Documents> docs := [ doc2, doc3 ]

termFrequencies = magicFunction(docs); 

System.out.pring( termFrequencies );

would result in the ouput:

[ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ]

remove all zeros:

[ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ]

Notice, that the result vetor contains only the term frequencies of the set of documents. NOT the overall frequencies of the whole index! The term 'planet' is present 4 times in the whole index but the source set of documents only contains it 2 times.

A naive implementation would be to just iterate over all documents in the docs set, create a map and count each term. But i need a solution that would also work with a document set size of 100.000 or 500.000.

Is there a feature in Lucene i can use to obtain this term vector? If there is no such feature, how would a data structure look like someone can create at index time to obtain such a term vector easily and fast?

I'm not that Lucene expert so i'am sorry if the solution is obvious or trivial.

Maybe worth to mention: the solution should work fast enough for a web application, applied to client search queries.

A: 

I don't know Lucene, however; your naive implementation will scale, provided you don't read the entire document into memory at one time (i.e use an on-line parser). English text is about 83% redundant so your biggest document will have a map with 85000 entries in it. Use one map per thread (and one thread per file, pooled obviouly) and you will scale just fine.

Update: If your term list does not change frequently; you might try building a search tree out of the characters in your term list, or building a perfect hash function (http://www.gnu.org/software/gperf/) to speed up file parsing (mapping from search terms to target strings). Probably just a big HashMap would perform about as well.

Justin
A: 

See if this helps

Mikos
thanks, but with this solution i get the overall frequency and not just the frequency for a subset of documents.
ManBugra
Perhaps you should consider creating a temp index for the sub-set of documents. This might be a hack approach, but you should get the all the power that Lucene provides.
Mikos
+1  A: 

Go here: http://lucene.apache.org/java/3_0_1/api/core/index.html and check this method

org.apache.lucene.index.IndexReader.getTermFreqVectors(int docno);

you will have to know the document id. This is an internal lucene id and it usually changes on every index update (that has deletes :-)).

I believe there is a similar method for lucene 2.x.x

Toader Mihai Claudiu
A: 

See this similar SO question which has a link to some example code.

mindas