Best practice for ensuring Solr/Lucene index is "up to date" after long rebuild

Hi all,

We have a general question about best practice/programming during a long index rebuild. This question is not "solr specific" could just as well apply to raw Lucene or any other similar indexing tool/library/black box.

The question

What is the best practice for ensuring Solr/Lucene index is "absolutely up to date" after long index rebuild i.e. if, during the course of a 12 hour index rebuild, users have add/change/delete db records or files (PDF's), how do you ensure the rebuild index at the very end “includes” these changes?

Context

Large Database and Filesystem (e.g. pdfs) indexed in Solr
Multi-core solr instance, where core0 is for “search” and all add/changes/deletes core1 is for “rebuild.” Core1 is a “temporary core”.
After the end of the rebuild we ‘move’ core1 to core0, so searches and updates go against the freshly-rebuilt db

Current Approach

Rebuild process queries the db and/or traverses the filesystem for “all db records” or “all files”
The rebuild will “get” new db records/pdfs if they occur at the end of the query/file system traversal. (E.g. The query is “select * from element order by element_id”. If we keep the result set open—i..e rather than build a big list all at once—the result set will include entries added at the end. Similarly if new files get added “at the end” (new folder or new file), file traversal will include these files.
The rebuild will not “get” the following:changes or deletion to db records/documents which the rebuild process already processed, “just reindexed”

Proposed approach

Track in the Solr client (i.e. via a db table) all add/change/deletes that occur to the db/filesystem
At the end of the rebuild (but before swapping the core), process these changes: i.e. delete from the index all deleted records/pdfs , reindex all updates and additions

Follow on

Is there a better approach
Does solr have any magic means to “meld” core0 into core1

Thanks

ansaurus

tags:

views:

answers:

Best practice for ensuring Solr/Lucene index is "up to date" after long rebuild

related questions