ansaurus

Question

Using Perl to cleanup a filesystem with one or more duplicates

Answer 1

A:

One apparent optimization is to use file size as an initial comparison basis, and only computer MD5 for files below a certain size or if you have a collision of two files with the same size. The larger a given file is on disc, the more costly the MD5 computation, but also the less likely its exact size will conflict with another file on the system. You can probably save yourself a lot of runtime that way.

You also might want to consider changing your approach for certain kinds of files that contain embedded meta-data that might change without changing the underlying data, so you can find additional dupes even if the MD5's don't match. I'm speaking of course of MP3 or other music files that have metadata tags that might be updated by classifiers or player programs, but which otherwise contain the same audio bits.

Jherico 2009-06-08 19:42:58

Answer 2

+1 A:

This isn't really a response to the larger logic of the program, but you should be checking for errors in open every time (and while we're at it, why not use the more modern form of open with lexical filehandles and three arguments):

open my $unique, '>', "$base/unique.txt"
  or die "Can't open $base/unique.txt for writing: $!";

If you don't want to explicitly ask each time, you could also check out the autodie module.

Telemachus 2009-06-08 19:57:05

Or be even more modern and go with IO::File

Todd Gardner 2009-06-08 20:10:52

That one strikes me as a taste thing: I don't realy want OO for opening files, but tastes vary. By 'modern' I really just meant Perl now supports lexical filehandles, so no need for barewords.

Telemachus 2009-06-08 20:19:28

Answer 3

+2 A:

Looking at the algorithm, I think I see why you are leaking files. The first time you encounter a file copy, you label it "unique":

if (!exists($duplicate_count{$filename})) {
   # assume unique
   $unique{$md5digest} = $path;
   # which implies 0 duplicates
   $duplicate_count{$filename} = 0;
}

The next time, you delete that unique record, without storing the path:

 # delete unique record
delete($unique{$md5digest});

So whatever filepath was at $unique{$md5digest}, you've lost it, and won't be included in unique+other+master.

You'll need something like:

if(my $original_path = delete $unique{$md5digest}) {
    // Where should this one go?
}

Also, as I mentioned in a comment above, IO::File would really clean up this code.

Todd Gardner 2009-06-08 20:19:26

Answer 4

A:

See here for related data on solutions in the abstract nature.

http://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer

IMPORTANT Note, as much as we'd like to believe 2 files with the same MD5 are the same file, that is not necessarily true. If your data means anything to you, once you've broken it down to a list of candidates that MD5 tells you are the same file, you need to run through every bit of those files linearly to check they are in fact the same.

Put this way, given a hash function ( which MD5 is ) of size 1 bits, there are only 2 possible combination's.

0 1

if your hash function told you 2 files both returned a "1" you would not assume they are the same file.

Given a hash of 2 bits, there are only 4 possible combination's,

 00  01 10 11

2 Files returning the same value you would not assume to be the same file.

Given a hash of 3 bits, there are only 8 possible combinations

 000 001 010 011 
 100 101 110 111

2 files returning the same value you would not assume to be the same file.

This pattern goes on in ever increasing amounts, to a point that people for some bizarre reason start putting "chance" into the equation. Even at 128 bits ( MD5 ), 2 files sharing the same hash does not mean they are in fact the same file. the only way to know is by comparing every bit.

There is a minor optimization that occurs if you read them start to end, because you can stop reading as soon as you find a differing bit, but to confirm identical, you need to read every bit.

Kent Fredric 2009-06-09 02:31:20

ansaurus

tags:

views:

answers:

Using Perl to cleanup a filesystem with one or more duplicates

related questions