views:

750

answers:

4

I am trying to create an lucene of around 2 million records. The indexing time is around 9 hours. Could you please suggest how to increase performance?

A: 

The simplest way to improve Lucene's indexing performance is to adjust the value of IndexWriter's mergeFactor instance variable. This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together.

http://search-lucene.blogspot.com/2008/08/indexing-speed-factors.html

Robert Harvey
+1  A: 

I wrote a terrible post on how to parallelize a Lucene Index. It's truly terribly written, but you'll find it here (there's some sample code you might want to look at).

Anyhow, the main idea is that you chunk up your data into sizable pieces, and then work on each of those pieces on a separate thread. When each of the pieces is done, you merge them all into a single index.

With the approach described above, I'm able to index 4+ million records in approx. 2 hours.

Hope this gives you an idea of where to go from here.

Esteban Araya
Hi EstebanThank you for the response. I am looking for something similar to what you have done. Could you please post some code snippets in your blog.Thanks,Gokul
Gokul
+1  A: 

Apart from the writing side (merge factor) and the computation aspect (parallelizing) this is sometimes due to the simplest of reasons: slow input. Many people build a Lucene index from a database of data. Sometimes you find that a particular query for this data is too complicated and slow to actually return all the (2 million?) records quickly. Try just the query and writing to disk, if it's still in the order of 5-9 hours, you've found a place to optimize (SQL).

dlamblin
+1  A: 

Hi Gokul,

The following article really helped me when I needed to speed things up:

http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

I found that document construction was our primary bottleneck. After optimizing data access and implementing some of the other recommendations, I was able to substantially increase indexing performance.

Jesse