views:

1401

answers:

4

We run full re-indexes every 7 days (i.e. creating the index from scratch) on our Lucene index and incremental indexes every 2 hours or so. Our index has around 700,000 documents and a full index takes around 17 hours (which isn't a problem).

When we do incremental indexes, we only index content that has changed in the past two hours, so it takes much less time - around half an hour. However, we've noticed that a lot of this time (maybe 10 minutes) is spent running the IndexWriter.optimize() method.

The LuceneFAQ mentions that:

The IndexWriter class supports an optimize() method that compacts the index database and speeds up queries. You may want to use this method after performing a complete indexing of your document set or after incremental updates of the index. If your incremental update adds documents frequently, you want to perform the optimization only once in a while to avoid the extra overhead of the optimization.

...but this doesn't seem to give any definition for what "frequently" means. Optimizing is CPU intensive and VERY IO-intensive, so we'd rather not be doing it if we can get away with it. How much is the hit of running queries on an un-optimized index (I'm thinking especially in terms of query performance after a full re-index compared to after 20 incremental indexes where, say, 50,000 documents have changed)? Should we be optimising after every incremental index or is the performance hit not worth it?

+2  A: 

Mat, since you seem to have a good idea how long your current process takes, I suggest that you remove the optimize() and measure the impact.

Do many of the documents change in those 2 hour windows? If only a small fraction (50,000/700,000 is about 7%) are incrementally re-indexed, then I don't think you are getting much value out of an optimize().

Some ideas:

  • Don't do an incremental optimize() at all. My experience says you are not seeing a huge query improvement anyway.
  • Do the optimize() daily instead of 2-hourly.
  • Do the optimize() during low-volume times (which is what the javadoc says).

And make sure you take measurements. These kinds of changes can be a shot in the dark without them.

Matt Quail
These kinds of changes *are* shots in the dark without them.
David Schmitt
Cheers, guess I was wondering whether people had experience of this before I dived in and started messing with a production system :)
Mat Mannion
Mat: yes, I realize you were looking for specific advice, and I was being a little general.In my experience (I've been using Lucene for years) you will be fine without the optimize(). I've out-right removed the optimize() from on of our systems because of its overhead.
Matt Quail
A: 

Hey, we have several indexes with round about 250.000 documents each. First i tried to use a live index. So every time an update or insert is done in our Databse Tables i did a deleteIndex and an insertIndex to Lucene. But it seems that many concurrent deletes/insert destroys the Index. I often get the error message that a index is not readable. Did you get this error message before?

I do it now in the same as you Mat. I store all updated/inserted/deleted Fields in an extra table which reads a cronjob every 30 minutes. It seems that the problem is solved now. Isnt there a solution for a real live search?! Im using Symfony and the sfLucenePlugin...

I appreciate all suggestions. Thanks...

+1  A: 

An optimize operation reads and writes the entire index, which is why it's so IO intensive!

The idea behind optimize operations is to re-combine all the various segments in the Lucene index into one single segment, which can greatly reduce query times as you don't have to open and search several files per query. If you're using the normal Lucene index file structure (rather than the combined structure), you get a new segment per commit operation; the same as your re-indexes I assume?

I think Matt has great advice and I'd second everything he says - be driven by the data you have. I would actually go a step further and only optmize a) when you need to and b) when you have low query volume.

As query performance is intimately tied to the number of segments in your index, a simple ls -1 index/segments_* | count could be a useful indicator for when in optimization is really needed.

Alternatively, tracking the query performance and volume and kicking off an optimize when you reach unacceptable low performance with acceptably low volume would be a nicer solution.

Alabaster Codify
A: 

In this mail, Otis Gospodnetic advices against using optimize, if your index is seeing constant updates. It's from 2007, but calling optimize() is in it's very nature an IO-heavy operation. You could consider using a more stepwise approach; a MergeScheduler

Steen