views:

332

answers:

4

I have two disks, one an ad-hoc backup disk, which is a mess with duplicates everywhere and another disk in my laptop which is an equal mess. I need to backup unique files and delete duplicates. So, I need to do the following:

  • Find all non-zero size files
  • Calculate the MD5 digest of all files
  • Find files with duplicate file names
  • Separate unique files, from master and other copies.

With the output of this script I will:

  • Backup the unique and master files
  • Delete the other copies

Unique file = no other copies

Master copy = first instance, where other copies exist, possibly matching preferential path

Other copies = not master copies

I've created the appended script, which seems to make sense to me, but:

total files != unique files + master copies + other copies

I have two questions:

  1. Where's the error in my logic?
  2. Is there a more efficient way of doing this?

I chose disk hashes, so that I don't run out of memory when processing enormous file lists.

#!/usr/bin/perl

use strict;
use warnings;
use DB_File;
use File::Spec;
use Digest::MD5;

my $path_pref = '/usr/local/bin';
my $base = '/var/backup/test';

my $find = "$base/find.txt";
my $files = "$base/files.txt";

my $db_duplicate_file = "$base/duplicate.db";
my $db_duplicate_count_file = "$base/duplicate_count.db";
my $db_unique_file = "$base/unique.db";
my $db_master_copy_file = "$base/master_copy.db";
my $db_other_copy_file = "$base/other_copy.db";

open (FIND, "< $find");
open (FILES, "> $files");

print "Extracting non-zero files from:\n\t$find\n";
my $total_files = 0;
while (my $path = <FIND>) {
  chomp($path);
  next if ($path =~ /^\s*$/);
  if (-f $path && -s $path) {
    print FILES "$path\n";
    $total_files++;
    printf "\r$total_files";
  }
}

close(FIND);
close(FILES);
open (FILES, "< $files");

sub compare {
  my ($key1, $key2) = @_;
  $key1 cmp $key2;
}

$DB_BTREE->{'compare'} = \&compare;

my %duplicate_count = ();

tie %duplicate_count, "DB_File", $db_duplicate_count_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_duplicate_count_file: $!\n";

my %unique = ();

tie %unique, "DB_File", $db_unique_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_unique_file: $!\n";

my %master_copy = ();

tie %master_copy, "DB_File", $db_master_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_master_copy_file: $!\n";

my %other_copy = ();

tie %other_copy, "DB_File", $db_other_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_other_copy_file: $!\n";

print "\nFinding duplicate filenames and calculating their MD5 digests\n";

my $file_counter = 0;
my $percent_complete = 0;

while (my $path = <FILES>) {

  $file_counter++;

  # remove trailing whitespace
  chomp($path);

  # extract filename from path
  my ($vol,$dir,$filename) = File::Spec->splitpath($path);

  # calculate the file's MD5 digest
  open(FILE, $path) or die "Can't open $path: $!";
  binmode(FILE);
  my $md5digest = Digest::MD5->new->addfile(*FILE)->hexdigest;
  close(FILE);

  # filename not stored as duplicate
  if (!exists($duplicate_count{$filename})) {
    # assume unique
    $unique{$md5digest} = $path;
    # which implies 0 duplicates
    $duplicate_count{$filename} = 0;
  }
  # filename already found
  else {
    # delete unique record
    delete($unique{$md5digest});
    # second duplicate
    if ($duplicate_count{$filename}) {
      $duplicate_count{$filename}++;
    }
    # first duplicate
    else {
      $duplicate_count{$filename} = 1;
    }
    # the master copy is already assigned
    if (exists($master_copy{$md5digest})) {
      # the current path matches $path_pref, so becomes our new master copy
      if ($path =~ qq|^$path_pref|) {
        $master_copy{$md5digest} = $path;
      }
      else {
        # this one is a secondary copy
        $other_copy{$path} = $md5digest;
        # store with path as key, as there are duplicate digests
      }
    }
    # assume this is the master copy
    else {
      $master_copy{$md5digest} = $path;
    }
  }
  $percent_complete = int(($file_counter/$total_files)*100);
  printf("\rProgress: $percent_complete %%");
}

close(FILES);    

# Write out data to text files for debugging

open (UNIQUE, "> $base/unique.txt");
open (UNIQUE_MD5, "> $base/unique_md5.txt");

print "\n\nUnique files: ",scalar keys %unique,"\n";

foreach my $key (keys %unique) {
  print UNIQUE "$key\t", $unique{$key}, "\n";
  print UNIQUE_MD5 "$key\n";
}

close UNIQUE;
close UNIQUE_MD5;

open (MASTER, "> $base/master_copy.txt");
open (MASTER_MD5, "> $base/master_copy_md5.txt");

print "Master copies: ",scalar keys %master_copy,"\n";

foreach my $key (keys %master_copy) {
  print MASTER "$key\t", $master_copy{$key}, "\n";
  print MASTER_MD5 "$key\n";
}

close MASTER;
close MASTER_MD5;

open (OTHER, "> $base/other_copy.txt");
open (OTHER_MD5, "> $base/other_copy_md5.txt");

print "Other copies: ",scalar keys %other_copy,"\n";

foreach my $key (keys %other_copy) {
  print OTHER $other_copy{$key}, "\t$key\n";
  print OTHER_MD5 "$other_copy{$key}\n";
}

close OTHER;
close OTHER_MD5;

print "\n";

untie %duplicate_count;
untie %unique;
untie %master_copy;
untie %other_copy;

print "\n";
A: 

One apparent optimization is to use file size as an initial comparison basis, and only computer MD5 for files below a certain size or if you have a collision of two files with the same size. The larger a given file is on disc, the more costly the MD5 computation, but also the less likely its exact size will conflict with another file on the system. You can probably save yourself a lot of runtime that way.

You also might want to consider changing your approach for certain kinds of files that contain embedded meta-data that might change without changing the underlying data, so you can find additional dupes even if the MD5's don't match. I'm speaking of course of MP3 or other music files that have metadata tags that might be updated by classifiers or player programs, but which otherwise contain the same audio bits.

Jherico
+1  A: 

This isn't really a response to the larger logic of the program, but you should be checking for errors in open every time (and while we're at it, why not use the more modern form of open with lexical filehandles and three arguments):

open my $unique, '>', "$base/unique.txt"
  or die "Can't open $base/unique.txt for writing: $!";

If you don't want to explicitly ask each time, you could also check out the autodie module.

Telemachus
Or be even more modern and go with IO::File
Todd Gardner
That one strikes me as a taste thing: I don't realy want OO for opening files, but tastes vary. By 'modern' I really just meant Perl now supports lexical filehandles, so no need for barewords.
Telemachus
+2  A: 

Looking at the algorithm, I think I see why you are leaking files. The first time you encounter a file copy, you label it "unique":

if (!exists($duplicate_count{$filename})) {
   # assume unique
   $unique{$md5digest} = $path;
   # which implies 0 duplicates
   $duplicate_count{$filename} = 0;
}

The next time, you delete that unique record, without storing the path:

 # delete unique record
delete($unique{$md5digest});

So whatever filepath was at $unique{$md5digest}, you've lost it, and won't be included in unique+other+master.

You'll need something like:

if(my $original_path = delete $unique{$md5digest}) {
    // Where should this one go?
}

Also, as I mentioned in a comment above, IO::File would really clean up this code.

Todd Gardner
A: 

See here for related data on solutions in the abstract nature.

http://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer

IMPORTANT Note, as much as we'd like to believe 2 files with the same MD5 are the same file, that is not necessarily true. If your data means anything to you, once you've broken it down to a list of candidates that MD5 tells you are the same file, you need to run through every bit of those files linearly to check they are in fact the same.

Put this way, given a hash function ( which MD5 is ) of size 1 bits, there are only 2 possible combination's.

0 1

if your hash function told you 2 files both returned a "1" you would not assume they are the same file.

Given a hash of 2 bits, there are only 4 possible combination's,

 00  01 10 11

2 Files returning the same value you would not assume to be the same file.

Given a hash of 3 bits, there are only 8 possible combinations

 000 001 010 011 
 100 101 110 111

2 files returning the same value you would not assume to be the same file.

This pattern goes on in ever increasing amounts, to a point that people for some bizarre reason start putting "chance" into the equation. Even at 128 bits ( MD5 ), 2 files sharing the same hash does not mean they are in fact the same file. the only way to know is by comparing every bit.

There is a minor optimization that occurs if you read them start to end, because you can stop reading as soon as you find a differing bit, but to confirm identical, you need to read every bit.

Kent Fredric