tags:

views:

59

answers:

3

I'm in the process of updating a tool that uses a Lucene index. As part of this update we are moving from Lucene 2.0.0 to 3.0.2. For the most part this has been entirely straightforward. However, in one instance I cant seem to find a straightforward conversion.

Basically I have a simple query and I need to iterate over all hits. In Lucene 2 this was simple, e.g.:

Hits hits = indexSearcher.search(query);
for(int i=0 ; i<hits.length() ; i++){
  // Process hit
}

In Lucene 3 the API for IndexSearcher has changed significantly and although I can bash together something that works, it is only by getting the top X documents and making sure that X is sufficiently large.

While the number of hits (in my case) is typically between zero and ten, there are anomalous situation where they could number much higher. Having a fixed limit therefor feels wrong. Furthermore, setting the limit really high causes OOME which means that space for all X possible hits is allocated immediately. As this operation is carried out alot, something reasonably efficient is desired.

Edit:

Currently I've got the following to work:

TopDocs hits = indexSearcher.search(query, MAX_HITS);
for (int i=0 ; i<hits.totalHits ; i++) {
   // Process hit
}

This works fine except that

a) what if there are more hits then MAX_HITS ?

and

b) if MAX_HITS is large then I'm wasting memory as room for each hit is allocated before the search is performed.

As most of the time there will only be a few hits, I don't mind doing follow up searches to get the subsequent hits, but I cant seem to find a way to do that.

A: 

Why don't you use Searcher.search(Query query, int n) ? You can specify the number of results you want back, and you can use the TopDocs object that is returned to iterate through the results.

Using Hits to process long result sets was a bad idea, because in the background the hits object would run more searches to fill in results that it didn't already have.

TopDocs only contains ids and scores, so you shouldn't have a memory problem even for large n.

bajafresh4life
That is basically what I'm currently doing. But what if I need result number n+1?
Kris
Just ask for N + M where M is some kind of constant value. I think you're worrying too much about memory here; TopDocs only contains scores and id's which is almost no memory at all, even for large N. If you don't believe me, run a profiler to find out.
bajafresh4life
A: 

How about using NumDocs from the index reader as the maximum number of results.

Do watch out for the edge case of zero documents in the index though...

Hope this helps,

Moleski
A: 

IndexSearcher has a method docFreq(Term). Invoking it does not seem to have a performance penalty and its output is then a suitable input parameter for the number of documents to get.

E.g.

int freq = searcher.docFreq(new Term(FIELD, value));
TopDocs hits = indexSearcher.search(query, freq);
for (int i=0 ; i<hits.totalHits ; i++) {
   // Process hit
}

This works because my query is essentially a TermQuery. If it was a more complex query then this wouldn't be suitable.

Kris