I'm building a large Lucene index and each document I insert requires a little bit of "putting together" before it can be inserted. I'm reading all of the documents from a database and inserting them into the index. Lucene allows you to build a few different indexes and merge them together later, so I've come up with this:
// we'll use a producer/consumer pattern for the job
var documents = new BlockingCollection<Document>();
// we'll have a pool of index writers (each will create its own index)
var indexWriters = new ConcurrentBag<IndexWriter>();
// start filling the collection with documents
Task writerTask = new Task(() => {
foreach(document in database)
documents.Add(document);
domains.CompleteAdding();
}, TaskCreationOptions.LongRunning);
writerTask.Start();
// iterate through the collection, obtaining index writers from the pool and
// creating them when necessary.
Parallel.ForEach(documents.GetConsumingEnumerable(token.Token), document =>
{
IndexWriter writer;
if(!indexWriters.TryTake(out writer))
{
var dirInfo = new DirectoryInfo(string.Concat(_indexPath, "\\~", Guid.NewGuid().ToString("N")));
dirInfo.Create();
var dir = FSDirectory.Open(dirInfo);
var indexWriter = new IndexWriter(dir, getAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
}
// prepare and insert the document into the current index
WriteDocument(writer, document);
indexWriters.Add(writer); // put the writer back in the pool
});
// now get all of the writers and merge the indexes together...
My only concern which gave me pause was that pulling an IndexWriter from the pool (and then putting it back in at the end) for every iteration might be less efficient than just creating the optimal number of threads to start with, but I also know that ConcurrentBag is very efficient and has extremely low processing overhead.
Is my solution ok? Or does it scream out for a better solution?
UPDATE:
After some tests, loading from the database is a bit slower than the actual indexing I think. Also the final index merge is slow too because I can only use one thread and I was merging 16 indexes with around 1.7 million documents. Still, I'm open to thoughts on the original question.