tags:

views:

137

answers:

3

I'm searching articles in PubMed via Lucene. Each of the 20,000,000 articles has an abstract with ~250 words and an ID.

At the moment I store my searches, with each take multiple seconds, in a TopDocs object. Searchs can find thousands of articles. I'm just interested in the ID of the article. Does Lucene load the abstracts internally into the TopDocs?

If so can I prevent that behavior through FieldSelectors or do FieldSelectors only work with IndexReader and don't work with IndexSearcher?

+2  A: 

No, Lucene does not load the values of fields into TopDocs. TopDocs only contains the doc number and score for each one of the matching documents.

If you're having performance issues, here's another SO question that can help you:

http://stackoverflow.com/questions/668441/optimizing-lucene-performance

bajafresh4life
+1  A: 

Lucene, by default, does not load any stored fields. If you want to retrieve only the ID field, and if you can afford to load up all the IDs in memory, then you can load all values as follows and reuse them.

String[] allIDs = FieldCache.DEFAULT.getStrings(indexReader, "IDFieldName")

Please check the answer for FieldCache. http://stackoverflow.com/questions/2511879/best-way-to-retrieve-certain-field-of-all-documents-returned-by-a-lucen-search/2513252#2513252

Shashikant Kore
+1  A: 

You're on the right lines.

Try using a SetBasedFieldSelector when you retrieve the document from the index.

As another poster noted, iterating through the hits will return a ScoreDoc object. This will give you the document Id that can be used to retrieve the document using the IndexReader associated with the IndexSearcher.

If IO is a problem because of loading fields you aren't interested in, you should be in for a pleasant surprise.

Hope this helps,

Moleski