views:

51

answers:

1

I'm building a large Lucene index and each document I insert requires a little bit of "putting together" before it can be inserted. I'm reading all of the documents from a database and inserting them into the index. Lucene allows you to build a few different indexes and merge them together later, so I've come up with this:

// we'll use a producer/consumer pattern for the job
var documents = new BlockingCollection<Document>();

// we'll have a pool of index writers (each will create its own index)
var indexWriters = new ConcurrentBag<IndexWriter>();

// start filling the collection with documents
Task writerTask = new Task(() => {
    foreach(document in database)
        documents.Add(document);
    domains.CompleteAdding();
}, TaskCreationOptions.LongRunning);
writerTask.Start();

// iterate through the collection, obtaining index writers from the pool and
// creating them when necessary.
Parallel.ForEach(documents.GetConsumingEnumerable(token.Token), document =>
{
    IndexWriter writer;
    if(!indexWriters.TryTake(out writer))
    {
        var dirInfo = new DirectoryInfo(string.Concat(_indexPath, "\\~", Guid.NewGuid().ToString("N")));
        dirInfo.Create();
        var dir = FSDirectory.Open(dirInfo);
        var indexWriter = new IndexWriter(dir, getAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
    }
    // prepare and insert the document into the current index
    WriteDocument(writer, document);
    indexWriters.Add(writer); // put the writer back in the pool
});

// now get all of the writers and merge the indexes together...

My only concern which gave me pause was that pulling an IndexWriter from the pool (and then putting it back in at the end) for every iteration might be less efficient than just creating the optimal number of threads to start with, but I also know that ConcurrentBag is very efficient and has extremely low processing overhead.

Is my solution ok? Or does it scream out for a better solution?

UPDATE:

After some tests, loading from the database is a bit slower than the actual indexing I think. Also the final index merge is slow too because I can only use one thread and I was merging 16 indexes with around 1.7 million documents. Still, I'm open to thoughts on the original question.

+1  A: 

One issue with Parallel.ForEach I have seen is that it can decide to add threads beyond the normal one per core when CPU utilization is low. This makes sense for tasks waiting for a remote server to respond but for slow disk-intensive process this can sometimes lead to poor performance as the disk is now thrashing.

If your processing is disk bound and not CPU bound you might want to try adding a ParallelOptions and set the MaxDegreeOfParallelism to your Parallel.ForEach to ensure it's not thrashing the disk needlessly.

Hightechrider
The question is - how do I decide what the number for MaxDegreeOfParallelism is?
Nathan Ridley
Sadly, as with many optimization tasks that's almost certainly an answer that can only be found by experimentation. Start at 1 per cpu core and increase it until throughput stops increasing.
Hightechrider