views:

641

answers:

3

Hi,

we are designing the search architecture for a corporate web application. We'll be using Lucene.net for this. The indexes will not be big (about 100,000 documents), but the search service must be always up and always be up to date. There will be new documents added to the index all the time and concurrent searches. Since we must have high availability for the search system, we have 2 application servers which expose a WCF service to perform searches and indexing (a copy of the service is running in each server). The server then uses lucene.net API to access the indexes.

The problem is, what would be the best solution to keep the indexes synced all the time? We have considered several options:

  • Using one server for indexing and having the 2nd server access the indexes via SMB: no can do because we have a single point of failure situation;

  • Indexing to both servers, essentially writing every index twice: probably lousy performance, and possibility of desync if eg. server 1 indexes OK and server 2 runs out of disk space or whatever;

  • Using SOLR or KATTA to wrap access to the indexes: nope, we cannot have tomcat or similar running on the servers, we only have IIS.

  • Storing the index in database: I found this can be done with the java version of Lucene (JdbcDirectory module), but I couldn't find anything similar for Lucene.net. Even if it meant a small performance hit, we'd go for this option because it'd cleanly solve the concurrency and syncing problem with mininum development.

  • Using Lucene.net DistributedSearch contrib module: I couldn't file a single link with documentation about this. I don't even know by looking at the code what this code does, but it seems to me that it actually splits the index across multiple machines, which is not what we want.

  • rsync and friends, copying the indexes back and forth between the 2 servers: this feels hackish and error-prone to us, and, if the indexes grow big, might take a while, and during this period we would be returning either corrupt or inconsistent data to clients, so we'd have to develop some ad hoc locking policy, which we don't want to.

I understand this is a complex problem, but I'm sure lots of people have faced it before. Any help is welcome!

+3  A: 

It seems that the best solution would be to index the documents on both servers into their own copy of the index.

If you are worried about the indexing succeeding on one server and failing on the other, then you'll need to keep track of the success/failure for each server so that you can re-try the failed documents once the problem is resolved. This tracking would be done outside of Lucene in whatever system you are using to present the documents to be indexed to Lucene. Depending on how critical the completeness of the index is to you, you may also have to remove the failed server from whatever load balancer you are using until the problem has been fixed and indexing has reprocessed any outstanding documents.

Sean Carpenter
Sean, this is currently our candidate option. I agree with you and itsadok that it seems the sanest choice. I'm also trying to find the sources for JdbcDirectory to see if a port to .NET+SQL server would be feasible.Will keep the question open for a while to see if new approaches come up, will accept this answer otherwise.
axel_c
I checked the same thing once. It didn't seem worth the effort as there is a bunch of DB transaction related stuff that is not trivial to port to .Net. There were also complaints of reduced speed using the JDBCDirectory stuff. The source is in the Compass project - http://svn.compass-project.org/svn/compass/trunk/src/main/src/org/apache/lucene/store/jdbc/
Sean Carpenter
After some thinking, this is what I see as the most viable solution: when an indexing/deindexing request is received, insert a row in a shared db table that works as a queue. Implement a simple win32 service that runs in both app servers and polls the queue every X seconds, indexing the content locally. When the content is succesfully indexed, the service marks the item as processed, otherwise it keeps trying.
axel_c
+1  A: 

+1 for Sean Carpenter's answer. Indexing on both servers seems like the sanest and safest choice.

If the documents you're indexing are complex (Word/PDF and the sorts), you could perform some preprocessing on a single server and then give that to the indexing servers, to save some processing time.

A solution I've used before involves creating an index chunk on one server, then rsyncing it over to the search servers and merging the chunk into each index, using IndexWriter.AddIndexesNoOptimize. You can create a new chunk every 5 minutes or whenever it gets to a certain size. If you don't have to have absolutely up-to-date indexes, this might be a solution for you.

itsadok
A: 

in the java world, we solved this problem by putting a MQ in front of the index(es). The insert was only complete when the bean pulled from the queue was successful, otherwise it just rolled back any action it took, marked on the doc as pending and it was tried again later

Aaron Saunders