The strategy to get recovery from broken files?

views:

132

answers:

The strategy to get recovery from broken files?

Hi All,

Me and my colleague are trying to implement a mechanism to get recovery from broken files on an embedded equipment.

This could be happened during certain circumstances, e.g. user takes off the battery during file writing.

Orz, but now we have just one idea:

Create duplicated backup files, and copy them back if dangerous file i/o is not finished properly.

This is kind of stupid, as if the backup files also broken, we are just dead.

Do you have any suggestions or good articles on this?

Thanks in advance.

+1 A:

Read up on database logging and database journal files.

A database (like Oracle) has very, very robust file writing. Do not actually use Oracle. Use their design pattern. The design pattern goes something like this. You can borrow these ideas without actually using the actual product.

Your transaction (i.e., Insert) will fetch the block to be updated. Usually this is in memory cache, if not, it is read from disk to memory cache.
A "before image" (or rollback segment) copy is made of the block you're about to write.
You change the cache copy, write a journal entry, and queue up a DB write.
You commit the change, which makes the cache change visible to other transactions.
At some point, the DB writer will finalize the DB file change.

The journal is a simple circular queue file -- the records are just a history of changes with little structure to them. It can be replicated on multiple devices.

The DB files are more complex structures. They have a "transaction number" -- a simple sequential count of overall transactions. This is encoded in the block (two different ways) as well as written to the control file.

A good DBA assures that the control file is replicated across devices.

When Oracle starts up, it checks the control file(s) to find which one is likely to be correct. Others may be corrupted. Oracle checks the DB files to see which match the control file. It checks the journal to see if transactions need to be applied to get the files up to the correct transaction number.

Of course, if it crashes while writing all of the journal copies, that transaction will be lost -- not much can be done about that. However, if it crashes after the journal entry is written, it will probably recover cleanly with no problems.

If you lose media, and recover a backup, there's a chance that the journal file can be applied to the recovered backup file and bring it up to date. Otherwise, old journal files have to be replayed to get it up to date.

S.Lott 2009-05-11 10:20:11

Its really very unlikely that an embedded device will be running ORACLE :-) . However sqlite would definately be worth a look as its free, has a tiny footprint and is generally an excellent piece of software.

James Anderson 2009-05-11 10:27:44

The point is NOT to use Oracle. The point is to borrow their design pattern for reliability.

S.Lott 2009-05-11 10:42:22

The way sqlite doing this - http://sqlite.org/atomiccommit.html

tingyu 2009-05-11 13:11:07

Depends on which OS etc. etc. but in most cases what you can do is copy to a temporary file name and as the last final step rename the files to the correct name.

This means the (WOOPS) Window of Opertunity Of Potential S****p is confined to the interval when the renames take place.

If the OS supports a nice directory structure and you lay out the files intelligently you can further refine this by copying the new files to a temp directory and renaming the directory so the WOOPS becomes the interval between "rename target to save" and "rename temp to target".

This gets even better if the OS supports Soft link directories then you can "ln -s target temp". On most OSes replacing a softlink will be an "atomic" operation which will work or not work without any messy halfway states.

All these options depend on having enough storage to keep a complete old and new copy on the file system.

James Anderson 2009-05-11 10:23:38

Sorry, I don't quite understand the second option: renaming the directory. What's the pros compare with first option? Seems the interval is longer than the first one.

tingyu 2009-05-11 10:46:36

With the second option the strategy is to group the files which are likely to change in one or two directories. As there will be fewer directories than files this should be quicker.

James Anderson 2009-05-12 01:38:09

ansaurus

tags:

views:

answers:

The strategy to get recovery from broken files?

related questions