views:

102

answers:

1

Hi to all!

I implement search engine with solr that import minimal 2 million doc per day. User must can search on imported doc ASAP (near real-time).

I using 2 dedicated Windows x64 with tomcat 6 (Solr shard mode). every server, index about 120 million doc and about 220 GB (total 500 GB).

I want to get backup incremental from solr index file during update or search.
after search it, find rsync tools for UNIX and DeltaCopy for windows (GUI rsync for windows). but get error (vanished) during update.

how to solve this problem.

Note1:File copy really slow, when file size very large. therefore i can't use this way.
Note2: Can i prevent corrupt index files during update, if windows crash or hardware reset or any other problem ?

Thanks in advanced Hamid

+4  A: 

Don't run a backup while updating the index. You will probably get a corrupt (therefore useless) backup.

Some ideas to work around it:

  • Batch up your updates, i.e. instead of adding/updating documents all the time, add/update every n minutes. This will let you run the backup in between those n minutes. Cons: document freshness is affected.
  • Use a second, passive Solr core: Set up two cores per shard, one active and one passive. All queries are issued against the active core. Use replication to keep the passive core up to date. Run the backup against the passive core. You'd have to disable replication while running the backup. Cons: complex, more moving parts, requires double the disk space to maintain the passive core.
Mauricio Scheffer
Isn't the passive index already the perfect backup?
Karussell
@Karussell: it's just a copy and not a proper backup by itself since you can't apply backup policies like off-site storage, incremental/differential/full backup, etc. There's a lot more to backup than just copying stuff.
Mauricio Scheffer
thanks a lot Mauricio
Hamid
@Mauricio: I am not very familiar with backup stuff. But what is off-site storage? (You could place the passive index on another server)
Karussell
@Karusell: off-site storage: placing copies of the backup in other buildings/cities/states/countries. The passive index should be as closest as possible to the main index to make replication fast. Backup should also be done close to the passive index to keep replication disabled as little as possible. Only when you have that backup you can choose to store it off-site.
Mauricio Scheffer