views:

546

answers:

6

Imagine that you are handed a new book everyday from an author. The book is a work in progress. He does not tell you what he has changed or added.

Your job is to identify the changes and additions, and pass ONLY these along to the publisher (who does not have time to read the entire book everyday)

For the purposes of this problem, the book is comprised of 1m lines of ascii text and growing (actually a MySQL backup file).

My current idea is to make a secure hash (SHA256 for example) of each line (1k Chars) and store it on HD. Since the hash is only 32bytes the file is only 32MB.

Then when we get the next file tomorrow, we go through it line by line, creating a new hash for each line and comparing it to the hash from the previous day.

When the process is finished we overwrite the hash file ready for the next day.

The comparison uses a binary search method of string compare ( > < operands) This returns a result in an average of four iterations.

I have not coded a btree index solution yet, but how would you tackle this?

+1  A: 

I would use diff.

If I needed to implement it within my own program, I would use one of the algorithms for finding the longest common subsequence of two sequences, treating each file as a sequence of lines.

Glomek
A: 

"Then when we get the next file tomorrow, we go through it line by line, creating a new hash for each line and comparing it to the hash from the previous day."

Got it: 1m lines of today's hash values compared with 1m lines of yesterday's values.

Do lines get inserted or removed? If not, this is a simple set of parallel reads to see if the hashes are different.

If there are adds or removals, you'll have to use the diff algorithm to determine the scope of the change.

All that's fine. Not too difficult to implement.

In that context, the following makes no sense.

The comparison uses a binary search method of string compare ( > < operands) This returns a result in an average of four iterations.

Is there some kind of ordering to the hash values? Or some tree structure?

S.Lott
A: 

A book of 1 million lines is huge: there are perhaps 30 - 50 lines per page, so let's be generous and assume 100 lines per page, which means 10,000 pages in the book.

Lines of 1 KB are also much larger than is normal; basic readability suggests nowhere near that many characters per line. Do you intend to hash lines of up to 1 KB, or chunk the file in 1 KB chunks? One problem with your scheme is that any repeated lines would have a repeated hash; you could never identify when one of those lines was added or deleted.

You would, presumably, need to notify the publisher of deleted lines too.

As with Glomek, I would use diff on the file. If you keep the file under RCS or CVS control, you would have just current version of the file and the diffs between prior versions stored. With this, you would be able to provide cumulative diffs over a week or month too.

And I probably wouldn't develop my own B-Tree indexing.

Jonathan Leffler
A: 

the solution you describe is somewhat similar to the rsync algorithm. one important point is that rsync has to recognize existing chunks anywhere in the target file, at any offset from original.

if your files are really record-structured, you can simplify a bit like you propose. if not, you need a rolling checksum.

also, do you have to recognize reorderings? or only insertions/deletions/replacements?

the most generic case is the full rsync algorithm, which goes like this:

  • parameters definition:

    1. choose a block size 512, or 1k usually work ok.
    2. choose a 'strong' checksum. something like from MD4 or so. 64bits are plenty.
    3. choose a 'weak' rolling checksum. one that lets you 'substract' the tail byte and 'add' a head byte to get the checksum of a block 1-byte forward. usually a 16-bit checksum works ok.
  • signature of old file:

    1. traverse the whole old file, at each block calculate both weak and strong checksums. with 16 and 64 bits checksums, and 512byte blocks that means 10bytes per block, or 20KB per megabyte. this is the 'signature'
  • create 'patch' with new file, and signature of old file:

    1. load the signature of the old file, the best is a hash table, with the weak checksums as keys, the strong checksums and block position are the values.
    2. read the first block of the new file
    3. calculate the weak checksum of loaded block
    4. check the hash table to see if the weak checksum is there.
    5. if found, calculate the strong checksum and compare with the one found in the hash
    6. if both checksums match, mark as 'got it' with the block reference in the hash, advance one whole blocksize and go back to step 3
    7. if the strong checksum didn't match, or if the weak checksum wasn't in the hash, 'roll' the weak checksum, that is, 'add' the next byte after the bloc, and 'substract' the first byte from the tail.
    8. add the byte 'substracted' from the tail to the list of 'new' bytes in the patch
    9. go back to step 4
  • apply patch to old file

    1. the 'patch' is the list of 'new' bytes that dropped off while rolling the checksum, plus the list of 'got it' blocks that match on the old file.
Javier
A: 

This is a technique used for incremental loading on a data warehouse. In the situation where you do not have the ability to identify changed data within a source system, you can take out a snapshot of the data and compare it with your last snapshot to identify the differences. This technique even gets a mention in Ralph Kimball's book on the subject and is used in an application I was involved in the design of.

You need a hashing algorithm with a very wide key as this approach is vulnerable to birthday attacks. MD5 or any of the SHA family would be good. It also cannot detect deletions without a post-process that goes through the difference looking for missing natural keys. This computation actually needs to be aware of the table structure.

ConcernedOfTunbridgeWells
A: 

One problem with your scheme is that any repeated lines would have a repeated hash; you could never identify when one of those lines was added or deleted

Very good point, but not an issue. A repeated line is a duplicate and all duplicates are deleted in the next stage of processing. So yes you are right, but it is not an issue.

"diff" link takes me to a page with a description of what I assume is an application? There is no download link, there is no code in any language... What am I missing here?

Some of you have talked about byte level granularity. This is not needed. only line level granularity is required because if anything on the line has been changed, the entire line (record) must be reprocessed becasue any change within the line affects the whole line.

So we are comparing lines of approx 1000 chars (no binary), in two files (todays snapshot and yesterdays snapshot) that are each approx 1m lines.

So using a secure hash like SHA256 (MD5 has collisions and is slow by comparison) I can process about 30MB/sec on my HO laptop. The server of course will chew through it a lot faster.

So if the file is arond 1GB, then making all the hases takes about 33sec, and reading 1Gb file using windows page memory takes about 30sec. Not horrific

Now we have two arrays of hashs representing the lines in each file. If we sort them, we can now use a binary search, so we iterate our way through the new files hashs looking for a match in the old files hashs. If we dont find it, that line is added to the changes file.

Keep in mind that the book of lines (legacy database) is unknown in every aspect. There is no guarantee of order of lines, location of changes, type of changes.

The suggestions of reading foreward page by page is good, but assumes that the two files are in the smae order up untill the first change. This cannot be assumed. The lines (rows) could be in any order. Also choosing an arbitrary blocksize violates the granularity of a line. For the purposes of this task, lines are immutable.

From that excellent link on invrementa loading: File Comparison Capture: This method is also known as the snapshot differential method. This method works by keeping before and after images of files that are of concern to the data warehouse. Records are compared to find changes, and record keys are compared to find inserts and deletes. This technique is most appropriate in the case of legacy systems due to the fact that triggers typically do not exist and transaction logs are either non-existent or in a proprietary format. Since most legacy databases have some mechanism for dumping data into files, this technique creates periodic snapshots and then compares the results to produce change records. Certainly, all the problems of static capture are present here. Added complexity is introduced by the challenge of comparing entire rows of information and by key identification and matching. This technique is complex in nature and typically not desirable but, in some cases, may be the only solution.

This is most relevant here: As we proceed into the realm of terabyte data warehouses, the ability to rebuild the data warehouse from scratch on a nightly basis will go the way of the dinosaur. The logical and efficient approach to updating the data warehouse involves some form of incremental update strategy.

So I guess I am on the right track then? A btree index would not afford an advantage?