views:

152

answers:

2

I;m using Lucene.net (2.9.2.2) on a (currently) 70Gig index.. I can do a fairly complicated search and get all the document IDs back in 1 ~ 2 seconds.. But to actually load up all the hits (about 700 thousand in my test queries) takes 5+ minutes.

We aren't using lucene for UI, this is a datastore between processes where we have hundreds of millions of pre-cached data elements, and the part I am working on exports a few specific fields from each found document. (ergo, pagination doesn't make since as this is an export between processes).

My question is what is the best way to get all of the documents in a search result? currently I am using a custom collector that does a get on the document (with a MapFieldSelector) as its collecting.. I've also tried iterating through the list after the collector has finished.. but that was even worse.

I'm open to ideas :-).

Thanks in advance.

A: 

Hmmm, given that you've found problems when your "get" code was moved outside the collector, it sounds like your problem is I/O related.

I'm almost dreading asking this given the size of your index, but have you tried:

  • Optimising the index
  • De-fragmenting your hard disk

If so, was there a noticeable effect on the rate documents are retrieved? BTW, I get 2333 items/second retrieved, if my shaky maths is correct...

Also, for the subset of fields you're retrieving, are any of them amenable to compression? Or have you already experimented with compression?

As a related matter, what kind of proportion of your index does 700 thousand items represent? It'd be interesting to get a feel for I/O throughput. You could probably work out the maximum theoretical data rate for your machine/hard-drive combination and see whether you're already close to the limit.

Moleski
ran an optimize last night.. Kicked off defrag this morning at 5:00am (its still running).. I'll let you know if that sped things up at all.. :-)
Josh Handel
unfortunetly there was no appriciable difference.. doing the calculations I'm sitting at about .51ms per document (averaged over 780,000 document read).. which isn't "bad" I'm trying field cache now but its (thus far) taken 50 minutes to load (not done yet) and 5 1/2 gigs of ram..
Josh Handel
I don't think the field cache will help, unless you are retrieving the same documents again.
Flynn81
+1  A: 

What fields do you need to search? What fields do you need to store? Lucene.net is probably not the most efficient way to store and retrieve the actual document texts. Your scenario suggests not storing anything, indexing the needed fields and returning a list of document ids. The documents themselves can be stored in an auxiliary database.

Yuval F
The size of the data being stored is very small (few hundred bytes max).. So going from Lucene to another storage engine probably isn't any faster (perhaps even slower).. because each documentID would result in a seporate call to that other engine.. unless I can make that call and get results at a rate of 3 or more per ms, its probably slower than Lucene (unfortunetly) at this point.
Josh Handel