views:

104

answers:

1

Hi Guys,

I am working on the developement of a application that will perform online backup of the files and folder in the PC, automatically or manually. Currently, I was keeping only the latest version of the file at the server.Now, I have to implement the versioning so that only the changes can be transfered to the online server and user must be able to download any of the available version of the file at Backup Server.

I need to perform Deduplication for this. Guys, though I am able to perform it using the fixed block size but facing an overhead of transferring the file having CRC information with each version backup.

I have never worked on such technology , so lacks in experience. I am eager to know is there any feasible method to embedd this functionality in the application without much pain. Is any third party tool would help to perform same thing? Please let me know?

Note: I am using FTP protocol to transfer the data.

+1  A: 

There's a program called dump that does something similar, but it operates on filesystem blocks rather than files. rsync also may be of interest.

You will need to keep track of a large number of blocks with multiple versions and how they fit into the various versions of the original files, so you will need some kind of database to track this information, and an efficient way to query it to determine which blocks in a given file need to be transferred. Also note that adding something to the beginning of a file will cause all your blocks to be "new" if you use a naive blocking and diff scheme.

To do this well will be very complex. I highly recommend you thoroughly research already-available solutions, and if you decide you need to write your own, consider the benefits of their designs carefully.

Tim Sylvester
Yeah, I am doing research since last week. What I came up with a solution where, I am block wise considering the file and for every version i am having a Structure_string (to keep track of availability of blocks, so that they can be reached from previous versions) + a compiled list of CRCs of each block, so that this list can be downloaded and compared with the list of current version to find out the difference. I need to confirm is my approach correct and how these are implemented in real world?
Sumeet
I'm sure it can be made to work, but it's not ideal. As I said certain types of changes will cause you to transfer the entire file (which could be GB) for a single-byte change. Depending on the block size, this could be thousands or millions of duplicate blocks (each a file?) on your server, which will make directory listings uselessly slow.I would look at the "delta encoding" link on the rsync page. This allows you to send only the parts of the file that actually changed. A lot of work has gone into making this efficient. The code of dump and rsync is available to look at as well.
Tim Sylvester
I have started working on Rsync code. what do you think? would it be helpful?
Sumeet
Rsync is very efficient and reliable. If it can be adapted for your needs, then that would be great. I meant to mention that FTP is basically an obsolete protocol, too, you're better off with something else.
Tim Sylvester
At the current state, we can't move back and stick to find out solution some how keeping FTP as the basic Protocol. I am able to use RDIFF method (customizing it according to our use) in order to find out the diff, and just need to transfer the diff. The problem we are facing currently is that: Diff is generated always with the base version of the file, not the previous version. For previous version we are facing overhead of calculating the Signature for each version and send it accross along with the diff. So just confused if we are moving at right track to do it or just got a miss.?
Sumeet