tags:

views:

866

answers:

7

I have two versions of a very large and complicated directory structure with tens of thousands of individual files and I want to look for significant file changes from one version to another.

Each and every file has changed in some minor way. For example you might have a file called intro.txt which would contain

[Build 1057 done by Mike 12:00] - (version 1)

[Build 1065 done by Mike 18:10] - (version 2)

I don't care about changes like that since they contain no useful information. I also don't care about corrections to spelling mistakes or the addition of a word or two.

What I really want to do is pull out which files have changed in a more major way. One way they might have changed is for a lot of extra content to have been added which would increase the filesize - that's the kind of change I am interested in.

So, how would you recursively parse through the directories looking for files that have increased (or decreased) by a set amount from one version to the next.

I'm running linux but pretty much any language will do.

+3  A: 

In python you want to start with the filecmp module.

Compare the directories - then print out files which are missing from one or other side (left_only and right_only).

Then for the diff_files you need to do more details comparison - use os.stat to find out the sizes, and print out the filename if the difference is too large.

Finally you need to recurse into common subdirectories.

Douglas Leeder
Hopefully that's enough to get you started.
Douglas Leeder
Thanks Doug - that looks like plenty to get started.Even though I gave free-range on the language, I had a feeling that the first answer would reference Python :)
MikeCroucher
+2  A: 

I'd do a diff -r -b FOLDER1 FOLDER2 to get a list of files that have changed, then process that list (using a bash script is sufficient) and just check the size difference for each file, and print the filename if the difference exceeds a threshold.

The -b option to diff is for brief output, it just prints a line for each difference found, it doesn't print per-line changes.

The -r is for recursive comparison of two directories, as often.

unwind
+2  A: 

In bash:

before_dir=foo.old
after_dir=foo.new
interesting_size=10
for file in `find $before_dir -type f`; do
    diff_size=$(diff -u "$file" "$after_dir$(echo $file | sed "s,$before_dir,,")" | wc -l)
    if [ $diff_size -ge $interesting_size ]; then
        echo $file;
    fi;
done
Daniel Watkins
+4  A: 

There are a few modules on CPAN that provide this. For eg.

File::DirCompare looks most promising....

 use File::DirCompare;

 File::DirCompare->compare('dirA', 'dirB', sub {
     my ($a, $b) = @_;

     ... callback runs on different or missing files   ...
     ... so perform extra checks on files $a & $b here ...

 });

So one example of showing files that are different by more than a prescribed number of bytes would be....

File::DirCompare->compare('dirA', 'dirB', size_diff_by_more_than(1024) );

sub size_diff_by_more_than {
    my $this = shift;

    return sub {
        my @files = grep { $_ } @_;

        if ( @files == 2 ) {
            # get the two file sizes and report if more than $this
            my @sizes = sort { $a <=> $b } map { (stat)[7] } @files;
            print "Different by more than $this bytes: $files[1]\n"
                if $sizes[1] - $sizes[0] > $this
        }
        else {
            print "Only: $files[0]\n";
        }
    };
}

/I3az/

draegtun
+2  A: 

You can generate a diff of the two directories, and use diffstat utility on it. Diffstat reports statistics on changed files: how many lines were added, removed or modified. I guess this will give you more information than just comparing file sizes.

Eugene Morozov
A: 

On the point of determining the amount of difference between two files:

It might be good to run a diff of the two files and put the length of the diff output in relation to the overall size of the file.

This (in addition to a file size comparison) would catch cases where there were a lot of changes in the file but the overall file size did not change significantly. This may or may not be appropriate for your use case.

+1  A: 
dicroce