views:

127

answers:

1

I'm on a personal quest to learn how the rsync algorithm works. After some reading and thinking, I've come up with a situation where I think the algorithm fails. I'm trying to figure out how this is resolved in an actual implementation.

Consider this example, where A is the receiver and B is the sender.

A = abcde1234512345fghij
B = abcde12345fghij

As you can see, the only change is that 12345 has been removed.

Now, to make this example interesting, let's choose a block size of 5 bytes (chars). Hashing the values on the sender's side using the weak checksum gives the following values list.

abcde|12345|fghij

abcde -> 495
12345 -> 255
fghij -> 520

values = [495, 255, 520]

Next we check to see if any hash values differ in A. If there's a matching block we can skip to the end of that block for the next check. If there's a non-matching block then we've found a difference. I'll step through this process.

  1. Hash the first block. Does this hash exist in the values list? abcde -> 495 (yes, so skip)
  2. Hash the second block. Does this hash exist in the values list? 12345 -> 255 (yes, so skip)
  3. Hash the third block. Does this hash exist in the values list? 12345 -> 255 (yes, so skip)
  4. Hash the fourth block. Does this hash exist in the values list? fghij -> 520 (yes, so skip)
  5. No more data, we're done.

Since every hash was found in the values list, we conclude that A and B are the same. Which, in my humble opinion, isn't true.

It seems to me this will happen whenever there is more than one block that share the same hash. I know I have skipped the step of calculating and checking the strong hash, but that won't make a difference since the second and third blocks are exactly the same

What am I missing?

+1  A: 

The rsync algorithm sends two checksums: one for each chunk, and a "rolling" checksum for the whole file. In your example, A will see a difference in the rolling checksum once it gets to the "doubled-up" block.

Dean Harding
Sending a checksum for the whole file is a great idea. I don't understand how A will see the difference once it gets to the doubled up block. It seems to me like the difference can only be detected once the entire checksum for A is computed, at which point we don't know what the repeating block is.
Kai
@Kai: Oops, I was just trying to rephrase that comment to make it clearer, and I lost it. The summary: from what I understand, it's a rolling *checksum*, not hash; the checksum for one block depends on the checksum for the previous block.
Jefromi
Ohh!!! The weak hash is a rolling checksum over the entire file! But its value recorded at the end of each block. Now it makes sense. Thanks codeka and Jefromi, I wouldn't have understand without both of your explanations.
Kai