views:

329

answers:

5

I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two interesting aspects to this problem:

  1. The files are too big to be analyzed by any diff library I know of because they are in-memory
  2. I don't actually need a diff - a diff typically has inserts, edits and deletes because it is meant to be read by humans. I can get away with less information: I only need "new range of bytes" and "copy bytes from old file from arbitrary offset".

I am currently at a loss at how to compute the delta from A to B under these conditions. Does anyone know of an algorithm for this?

Again, the problem is simple: Write an algorithm that can store the files A and B with as few bytes as possible given the fact that both are quite similar.

Additional info: Although big parts might be identical they are likely to have different offsets and be out of order. The last fact is why a conventional diff might not save much.

+6  A: 

Take a look at RSYNCs algorithm, as it's designed pretty much to do exactly this so it can efficiently copy deltas. And the algorithm is pretty well documented, as I recall.

Will Hartung
Not sure why this hasn't gotten more votes. rsync work great, it's simple, and an excellent free implementation is available (see martinus's answer). The algorithm is described here: http://samba.anu.edu.au/rsync/tech_report/tech_report.html
Jason Orendorff
I chose this answer because rsync's rolling hash function is the key to solving this problem.
usr
+6  A: 

Hi,

one question is what is the record size in your files, i.e. can the offsets change byte by byte or do the files consist of, say, 1024B blocks. Assuming the data is byte-oriented, you could do the following:

  1. Create a suffix array for the file A. This array is a permutation of all index values to the file A. If A has 2^37 bytes then the index array is easiest represented by 64-bit integers, so every byte (offset to the file) corresponds to 8 bytes in the index array, so the index array will be 2^40 bytes long then. E.g. 800 GB, say. You can also index only every 1024th location, say, to reduce the size of the index array. This then detoriates the quality of packing depending on how long the average runs of copyable fragments are.

  2. Now then to greedily pack the file B you start from its beginning at offset o=0 and then use the index array to find the longest match in A that matches the data starting at 'o'. You output the pair in the packed file. This takes in your case without any encoding 16 bytes, so if the run is < 16 bytes you actually lose space. This can be easily remedied by using then bit-level encoding and using a bit marker to mark whether you encode an isolated byte (marker + 8 bits = 9 bits) or an offset/length pair (marker + 40 bits + 40 bits = 81 bits), say. After packing the longest fragment at o, increase o to the next byte after the fragment and repeat until at end of file.

The construction and use of a suffix array is easy and you should find references easily. In high-speed applications people use suffix trees or suffix tries instead, which are more complex to manipulate but provide faster lookup. In your case you're going to have the array on secondary storage and if the run speed of the packing phase is not an issue a suffix array should be enough.

antti.huima
+6  A: 

That is exactly the problem known as "data deduplication". The most commonly used approach is:

  • Read over the files in blocks:
    • Split the data of the so called "chunks". The most often used approach is called "Content defined Chunking using Rabins Fingerprinting method" (Code). Using that chunking approach leads to a better deduplication on most data set then using static sized chunks (e.g. shown here).
    • Fingerprint the chunks using a cryptographic fingerprinting method, e.g. SHA-256.
    • Store the fingerprints in an index and lookup for each chunk if the fingerprint is already known. If the fingerprint is known, there is no need to store the chunk a second time. Only when the fingerprint is not known, the data has to be stored.

Such an data deduplication algorithm is not as exact as e.g. xdelta, but it is faster and more scalable for large data sets. The chunking and fingerprinting is performed with around 50 MB/s per core (Java). The index size depends on the redundancies, the chunk size and the data size. For 200 GB, it should fit in memory for chunk sizes of e.g. 16KB.

Bentleys and Mciloys compression approach is very similar (used e.g. by Googles BigTable), however I am not aware of any out-of-the box command line tools using the compression technique.

The "fs-c" open source project contains most of the code that is necessary. However, fs-c itself tries only to measure the redundancies and the analzye files in-memory or using a Hadoop cluster.

dmeister
Here is a tool that combines the Bentley Mciloy compression with zlib compression: http://www.di.unipi.it/~ferragin/software.html
dmeister
A: 

Depending on your performance requirements, you could get away with sampling the chunks you fingerprint, and growing them when they match. That way you don't have to run a checksum on your entire large file.

If you need arbitrary byte alignments and you really care about performance, look at the simhash algorithm, and use it to find similar but unaligned blocks.

Tobu
+2  A: 

You can use rdiff, which works very well with large files. Here I create a diff of two big files A and B:

  1. Create a signature of one file, with e.g.

    rdiff signature A sig.txt
    
  2. using the generated signature file sig.txt and the other big file, create the delta:

    rdiff delta sig.txt B delta
    
  3. now delta contains all the information you need to recreate file B when you have both A and delta. To recreate B, run

    rdiff patch A delta B
    

In Ubuntu, just run sudo apt-get install rdiff to install it. It is quite fast, I get about 40 MB per second on my PC. I have just tried it on a 8GB file, and the memory used by rsync was about 1MB.

martinus