tags:

views:

860

answers:

5

I am writing a Perl script (in Windows) that is using File::Find to index a network file system. It works great, but it takes a very long time to crawl the file system. I was thinking it would be nice to somehow get a checksum of a directory before traversing it, and it the checksum matches the checksum that was taken on a previous run, do not traverse the directory. This would eliminate a lot of processing, since the files on this file system do not change often.

On my AIX box, I use this command:

csum -h MD5 /directory

which returns something like this:

5cfe4faf4ad739219b6140054005d506  /directory

The command takes very little time:

time csum -h MD5 /directory
5cfe4faf4ad739219b6140054005d506  /directory

real    0m0.00s
user    0m0.00s
sys     0m0.00s

I have searched CPAN for a module that will do this, but it looks like all the modules will give me the MD5sum for every file in a directory, not for the directory itself.

Is there a way to get the MD5sum for a directory in Perl, or even in Windows for that matter as I could call a Win32 command from Perl?

Thanks in advance!

+2  A: 

In order to get a checksum you must read the files, this means you will need to walk the filesystem, which puts you back in the same boat you are trying to get out of.

Chas. Owens
So is it a feature of AIX that lets the "csum" command not walk the filesystem? Maybe it is using the modified timestamp on the dir? Because the example I posted above took "0" seconds on a 1.5 Terabyte filesystem.
BrianH
The csum command operates on files, a directory is a file, so it is checksumming the directory as a file (i.e. not recursively). Add a file in a subdirectory of the one you are running csum against. You should still see the same checksum. You can also try appending some data to an already existing file, that shouldn't change the checksum either (directories contain just names, metadata is stored in inodes).
Chas. Owens
Okay - I touched a file in a lower sub-directory, and you are correct, the checksum did not change. But even that would help me - if I am at the lowest sub-directory, I would like to checksum that dir, because then I don't have to read all the files in that dir. Any way to do that with Perl?
BrianH
The checksum will also only tell you if the names of the files have changed. Their contents could be completely different. So, the Perl equivalent would be to use md5_hex from Digest::MD5 and the sorted values from file globs that list every file in the directory: md5_hex join '', sort <$dir/*>, <$dir/.*>
Chas. Owens
+3  A: 

Can you just read the last modified dates of the files and folders? Surely that's going to be faster than building MD5's?

SpliFF
+1, the AIX tool is probably just hashing the metadata.
Dave
The manpage for csum is misleading: "The csum command calculates a message digest for the specified files using the specified hash algorithm. This provides a reliable way to verify file integrity." - this does not lead me to believe it is hashing metadata.
BrianH
Although it could be - not disputing your claim, I'm just saying the manpage doesn't make it sound like that.
BrianH
I don't have an AIX box, but there's no way its FS is keeping around an md5 digest of its contents. What happens if you try to cat the directory?
Dave
cat'ing the directory produces a (somewhat garbled) list of the files immediately inside (not in any subdirs though). Probably a good chance that something like this is being used for the MD5sum.
BrianH
+1  A: 

From what I know you cannot get an md5 of a directory. md5sum on other systems complains when you provide a directory. csum is most likely giving you a hash of the directory file contents of the top level directory, not traversing the tree.

You can grab the modified times for the files and hash them how you like by doing something like this:

sub dirModified($){
    my $dir = @_[0];
    opendir(DIR, "$dir");
    my @dircontents = readdir(DIR);
    closedir(DIR);

    foreach my $item (@dircontents){
        if( -f $item ){
            print -M $item . " : $item - do stuff here\n";
        } elsif( -d $item && $item !~ /^\.+$/ ){
            dirModified("$dir/$item");
        }
    }
}

Yes it will take some time to run.

moshen
+1  A: 

In addition to the other good answers, let me add this: if you want a checksum, then please use a checksum algorithm instead of a (broken!) hash function.

I don't think you don't need a cryptographically secure hash function in your file indexer -- instead you need a way to see if there are changes in the directory listings without storing the entire listing. Checksum algorithms do that: they return a different output when the input is changed. They might do it faster since they are simpler than hash functions.

It is true that a user could change a directory in a way that wouldn't be discovered by the checksum. However, a user would have to change the file names like this on purpose since normal changes in file names will (with high probability) give different checksums. Is it then necessary to guard against this "attack"?

One should always consider the consequences of each attack and choose the appropriate tools.

Martin Geisler
A: 

I did one of these in python if your interested:

http://akiscode.com/articles/sha-1directoryhash.shtml

theangrybaby