One problem with your scheme is that any repeated lines would have a repeated hash; you could never identify when one of those lines was added or deleted
Very good point, but not an issue. A repeated line is a duplicate and all duplicates are deleted in the next stage of processing. So yes you are right, but it is not an issue.
"diff" link takes me to a page with a description of what I assume is an application?
There is no download link, there is no code in any language... What am I missing here?
Some of you have talked about byte level granularity. This is not needed. only line level granularity is required because if anything on the line has been changed, the entire line (record) must be reprocessed becasue any change within the line affects the whole line.
So we are comparing lines of approx 1000 chars (no binary), in two files (todays snapshot and yesterdays snapshot) that are each approx 1m lines.
So using a secure hash like SHA256 (MD5 has collisions and is slow by comparison) I can process about 30MB/sec on my HO laptop. The server of course will chew through it a lot faster.
So if the file is arond 1GB, then making all the hases takes about 33sec, and reading 1Gb file using windows page memory takes about 30sec. Not horrific
Now we have two arrays of hashs representing the lines in each file.
If we sort them, we can now use a binary search, so we iterate our way through the new files hashs looking for a match in the old files hashs. If we dont find it, that line is added to the changes file.
Keep in mind that the book of lines (legacy database) is unknown in every aspect. There is no guarantee of order of lines, location of changes, type of changes.
The suggestions of reading foreward page by page is good, but assumes that the two files are in the smae order up untill the first change. This cannot be assumed. The lines (rows) could be in any order. Also choosing an arbitrary blocksize violates the granularity of a line. For the purposes of this task, lines are immutable.
From that excellent link on invrementa loading:
File Comparison Capture: This method is also known as the snapshot differential method. This method works by keeping before and after images of files that are of concern to the data warehouse. Records are compared to find changes, and record keys are compared to find inserts and deletes. This technique is most appropriate in the case of legacy systems due to the fact that triggers typically do not exist and transaction logs are either non-existent or in a proprietary format. Since most legacy databases have some mechanism for dumping data into files, this technique creates periodic snapshots and then compares the results to produce change records. Certainly, all the problems of static capture are present here. Added complexity is introduced by the challenge of comparing entire rows of information and by key identification and matching. This technique is complex in nature and typically not desirable but, in some cases, may be the only solution.
This is most relevant here:
As we proceed into the realm of terabyte data warehouses, the ability to rebuild the data warehouse from scratch on a nightly basis will go the way of the dinosaur. The logical and efficient approach to updating the data warehouse involves some form of incremental update strategy.
So I guess I am on the right track then? A btree index would not afford an advantage?