ansaurus

Question

Modification to duplicate removal script

Answer 1

+2 A:

Seeing this code makes me happy I do not have to maintain it. There are a number of specific items you should address before anyone in his right mind should consider working on this:

Use strict and warnings.

Use Getopt::Long for command line arguments.

Declare variables in the smallest applicable scope instead of at the top of a subroutine.

Scope variables lexically using my and do not use local. For more information, see Coping with scoping.

Looking at:

    for $dir (@dirs) {
    if(!opendir(D, $dir)) {
    warn "$dir: $!";
    next;
    }

    $dir=~s/\/$//;

do you know which directory the last s/// is operating on?

Similarly, if you pass multiple directories on the command line, the value in the package global handle D is ambiguous. The structure of the program should be:

use strict; use warnings;
use File::Spec::Functions qw( catfile );
use Getopt::Long;

my %opt = (
    verbose => 0,
    killdupes => 0,
);

GetOptions(\%opt, 'verbose', 'killdupes');

my %files;

for my $dir ( @ARGV ) {
    process_directory( \%files, $dir );
}

# do whatever you want with dupes in %files

use YAML;
print Dump \%files;

sub process_directory {
    my ($files, $dir) = @_;

    my $dir_h;

    unless ( opendir $dir_h, $dir ) {
        warn "Failed to open directory '$dir': $!\n";
        return;
    }

    while ( defined( my $file = readdir $dir_h ) ) {
        my $path = catfile $dir, $file;
        print "$path\n" if $opt{verbose};
        push @{ $files->{ keyof($file) } }, $path;
    }
}

sub keyof {
    return int(rand 2);
}

Finally, it looks like you are parsing/trying to parse Vcard files by hand. There are a bunch of Vcard related modules on CPAN.

Sinan Ünür 2010-09-08 01:26:48

Thanks for such a quick reply. Unfortunately I didn't actually write this script it was provided to us as a tool but without the removal of the originating file its worthless to us. I know its pretty ugly but never thought it would take so much work to modify. Its hard to know the directory thats its operating because the '.' in the command tell is to run in its current dir. Typically in our case, its been first_last/Calendar/#msgs where it is looking. I really figured we could remove originating file after its removed all dups for the item :(

Aaron 2010-09-08 01:50:33

If you don't have the skills yourself, you should probably find someone who does. We can give you advice and help, but this isn't a free programming service.

brian d foy 2010-09-08 05:11:26

thank you for being honest. i'm was never looking for a free programming service. was hoping for advice, that wasn't just redo my entire script. thanks.

Aaron 2010-09-08 12:59:38

Answer 2

+2 A:

Here's a script I have that searches through a bunch of directories and deletes duplicate files. I mostly use it to get rid of duplicated digital photos. I go through all the files and note their MD5 digest. I keep a hash of all the files matching that digest. At the end, I display all the dupes then delete all but the first one that I found.

It's just a quick and dirty script, but the same process might work for you.

#!/usr/local/bin/perl
use strict;
use warnings;

use Digest::MD5;
use File::Spec::Functions;

my @dirs =  @ARGV;
print "Dirs are @dirs\n";

my %digests;
DIR: foreach my $dir ( @dirs )
    {
    opendir my $dh, $dir or do {
        warn "Skipping $dir: $!\n";
        next DIR;
        };

    my @files = 
        map { catfile( $dir, $_ ) }
        grep { ! /^\./ }
        readdir $dh;

    FILE: foreach my $file ( @files )
        {
        next if -d $file;
        my $digest = md5_digest( $file );

        push @{ $digests{ $digest } }, $file;
        }
    }

my $count = 0;
foreach my $digest ( keys %digests )
    {
    next unless @{ $digests{$digest} } > 1;

    local $" = "\n"; # "
    print "Digest: $digest\n@{ $digests{$digest} }\n------\n";

    $count++;

    # unlink everything but the first one
    unlink @{ $digests{$digest} }[1..$#{ $digests{$digest}]
    }

print "There were $count duplicated files\n";

sub md5_digest
    {
    my $file = shift;

    open my($fh), '<', $file or do {
        warn "cannot digest $file: $!";
        return;
        };

    my $ctx = Digest::MD5->new;

    $ctx->add( do { local $/; <$fh> } );

    return $ctx->hexdigest;
    }

brian d foy 2010-09-08 05:16:20

ansaurus

tags:

views:

answers:

Modification to duplicate removal script

related questions