views:

588

answers:

3

I want to use Lucene.NET for fulltext search shared between two apps: one is an ASP.NET MVC application and the other one is a console application. Both applications are supposed to search and update index. How the concurrency should be handled?
I found a tutorial on ifdefined.com where the similar use case is discussed. My concern is that locking will be a big bottleneck.

PS: Also I noticed that IndexSearcher uses a snapshot of index and in the tutorial mentioned above searcher is created only when index is updated. Is this a good approach? Can I just create a regular searcher object at each search and if yes what is the overhead?

I found a related question http://stackoverflow.com/questions/193624/does-lucene-net-manage-multiple-threads-accessing-the-same-index-one-indexing-wh what claims that interprocess concurency is safe. Does it mean that it is are no race conditions for index?

Also one very important aspect. What is the performance hit involved if let's say 10-15 threads are trying to update Lucene index via acquiring shared lock presented in this solution? Thanks.

+3  A: 

I also have a lucene search index that's used by multiple clients, I solve this issue by making the 'Lucene Search Service' a separate web service running in its own App Domain. As both clients hit the same web service to search or update the index I can make it thread-safe with locks on Lucene's Indexers.

Other than that if you want to keep it in process I suggest using file locks to make sure only one client can write to the index.

To get it to use a a new index, I create one on the side and then tell the Search Index service to swap over to use the new index by safe disposing of any Indexers on the current index and renaming directories, e.g.

  • Index.Current > Index.Old
  • Index.New > Index.Current
mythz
Could you be more clear about file locks? So you are rebuilding a new index and then making the switch to the new one and after that delete the old one? Thanks.
Jenea
Yeah I just meant to create an empty file called something like 'write.lock' on the file-system to indicate you are writing to the index. When you are finished writing to the index you just remove it. Then you just have to make sure that only the process that created the lock can read/write to the index.
mythz
+8  A: 

First of all we have to define a "write" operation. A write operation will object a lock once you start a write operation and will continue until you close the object that is performing the work. Such as creating an IndexWriter and indexing a document will cause the write to object a lock and it will keep this lock until you close the IndexWriter.

Now we can talk about the lock a little bit. This lock that is object is a file based lock. Like mythz mentioned earlier, there is a file called 'write.lock' that is created. Once a write lock is objected it is exclusive! This lock causes all index modifying operations (IndexWriter, and some methods from IndexReader) to wait until the lock is removed.

Overall you and have multiple reads on an index. You can even read and write at the same time, no problem. But there is a problem when having multiple writers. If one thread is waiting for the lock too long it will time out.

1) Possible Solution #1 Direct Operations

If you are sure that your indexing operations are short and quick, you may be able to just use the same index at the same time. Otherwise you will have to think about how you want to organize the indexing operations of the applications.

2) Possible Solution #2 Web Service

Since you are working with a web solution it might be possible to create a web service. When implementing this web service I would dedicate a worker thread for indexing. I would create a work queue to contain the work and if the queue contained multiple jobs to do, it should grab them all and do them into batch. This will solve all of the problems.

3) create another index, then merge

If the console application does heavy work on the index you may be able to look into having the console application you could create a seperate index in the console application and then merge the indexes at some safe scheduled time using IndexWriter.AddIndexes.

from here you can do this in two ways, you can merge with the direct index. Or you can merge to create a 3rd index, and then when this index is ready replace the original index. You have to be careful in what your doing here as well to make sure that your not going to lock something in heavy use and cause a timeout for other write operations.

4) Index & Search multiple indexes

Personally I think people need to separate their indexes out. This helps separates responsibilities of the programs and minimizes down time and maintained of having a single point for all indexes. For example, if your console application is responsible for only adding in certain fields or your are kind of extending an index you could look separate the indexes out, but maintain identity by using an ID field in each document. Now with this you can take advantage of the built in support for searching multiple indexes using the MultiSercher class. Or if your wanting there is also a nice ParallelMultiSearch class that can search both indexes at once.

5) Look into SOLR

Something else that can help your issue of maintaining a single place for you index, you could change your program to work with a SOLR server. http://lucene.apache.org/solr/ there is also a nice SOLRNET http://code.google.com/p/solrnet/ library that can be helpful in this situation. Although I'm not experienced with solr but i am under the impression that it will help you manage situation such as this. Also it has other benefits such as hit highlighting and searching for related items by finding items "MoreLikeThis", or provide spell checking.

I'm sure there are other methods but these are all the ones that I can think of. Overall it your solution depends upon how many people are writing and how up to date the search index you need it to be. Overall if you can defer some operations for a latter time and do some batch operations in any situation will give you the most performance. My suggestion is to understand what your able to work with and go from there. good luck

Andrew Smith
Wow. Thanks. I was thinking of a solution that is somehow related to 2_. In meantime I have other question: "How many indexes can ParallelMultiSearch or MultiSercher support"?
Jenea
+2  A: 

If you will have multiple writers in different processes, and they will spend more than 10 seconds writing their changes to the index (which will cause waiting writers to timeout), then you can synchronize access across processes by using named Mutexes. Simply open/create a Mutex of the same global name in each application, and use Mutex.WaitOne before writing, and Mutex.ReleaseMutex after writing.

var mut = Mutex.OpenExisting("myUniqueMutexName"); // wrap in try..catch to create if non-existent
mut.WaitOne();
try {
  // write logic
}
finally {
  // recover from write failure
  mut.ReleaseMutex();
}

Probably better to make the Mutex a singleton since they're a little expensive to construct.

Update (per comment):

If the processes are on separate machines, I think your only alternative is to layer your own filesystem locking (using old-fashioned lock files) to synchronize access. Since the built-in locking uses filesystem locks anyway, I would actually recommend you just increase the IndexWriter timeout everytime you construct one.

var iw = new IndexWriter();
iw.WRITE_LOCK_TIMEOUT = 60000;

You can also just keep trying a specified number of times.

var committed = false;
var attempts = 0;
while(!committed && attempts < 10) {
  try {
    // write logic
    committed = true;
  } catch (LockObtainFailedException) {
    attempts++;
  }
}
gWiz
Thank you for your solution. I would be a good one but because of the infrastructure it cannot be applied because processes are running on different machines and are accessing Lucene index in a shared network folder. So mutex won't be able to block those processes.
Jenea
My bad though I didn't specify this in question. I'm sorry.
Jenea
I've updated my answer in response to your comments.
gWiz