views:

309

answers:

5

I need to write a tool in C++ to determine the changed bits in a file compared against another file for replication. What would be the best method of accomplishing this?

I don't have a specific OS or library in mind, I'm open to suggestions. My primary goal is reducing the amount of network traffic involved in replicating.

+17  A: 

Look at rsync - it splits the file into blocks, calculates a checksum for each block, and transmits only the checksum to determine if there are any changesto the destination before transmitting the block data only if necessary.

Martin Beckett
It's actually even better than that -- it uses a rolling checksum, which can detect equal blocks even if they've been shifted to non-block-aligned locations.
ephemient
While I was writing up my answer, I realized that you forgot to mention a good selling point of rsync: it works without having both versions of the file being synced on the sender machine.
Alexander
A: 

I would start by trying some implementation of diff (http://en.wikipedia.org/wiki/Diff)

Steve
Diff typically works by comparing all the data. Not what you want for reducing network traffic...
dmckee
@dmckee, it depends on the exact scenario. In some scenarios both versions of each file are available on the sender machine. In that case you can generate diffs/deltas that will result in much less traffic than rsync's one.
Alexander
+2  A: 

If you can't use rsync as is, check librsync. It's old, but the code is easy to read and improve.

Javier
A: 

suggestion: Use a hash function & a divide & conquer approach to narrow down the block of change(s). Not exactly a collision proof solution, but SHA-2 IMO could work for you.

Kapil Kapre
+2  A: 

If you don't have the old and new versions of files on the same machine, then rsync-like algorithms are the way forward (see previous answers). If you do have both the old and the new versions of files on the same machine, you can then do better than rsync: generate compressed diffs and send them over the network.

For generating efficient diffs, have a look at VCDIFF (RFC 3284) binary delta compression. One good implementation is xdelta (www.xdelta.org). It's fairly easy to implement a decoder/decompressor if you want to avoid using xdelta on the receiving end because of license issues. Writing your own VCDIFF diff generator that will generate compact diffs is much more complicated (think searching for moved blocks as an example).

In VCDIFF the diffs can also be sourceless, meaning they decompress into the target file without any source file (the file to which a diff is applied) at hand -- in VCDIFF compressing a file is a special case of creating a compressed delta between two files. This is useful because you can use the same format regardless of whether the destination has a version of your file.

Alexander