tags:

views:

48

answers:

2

Hello, I have a script that we've been using for maintenance to clear up duplicate calendar items on our mail server. What we've found is that although it can remove the duplicate items we need to ALSO remove the originating item.

The script is run by dups.pl . --killdups then it will report which are a dup of the original.

What I'm not sure how to do is tell the script to remove the original.

Since we display which file they are a dup of, it makes sense that we should be able to remove it at the same time. If anyone could help me modify this it would be greatly appreciated.

It is in the for loop that it finds the dups and then "unlinks" them:

foreach $l (@l) {
        @fields=split(/:--:/,$l,3);
            if($last[0] eq $fields[0] && -f "$dir/$fields[2]" && -f "$dir/$last[2]") {
            $dups++;
            print "$dir/$fields[2] is a dup of $dir/$last[2]\n";
            if($verbose==1) { print "    --- $fields[0]\n" }
            if($killdups==1) {
            print "Deleting $dir/$fields[2]\n";
                unlink "$dir/$fields[2]";
            }

The problem that I have noticed is that if I chose to unlink "$dir/$last[2]" in this area too then the script has a problem as it looks for that original as a means to remove the dups. Anyone, know of a quick way to modify this so that I can remove the dups and remove the original at the very end?

Here is the whole script in case you need it:

#!/usr/bin/perl

# Usage: dups.pl [--killdups][--verbose] <path to directory>

foreach $a (@ARGV) {
    if($a=~/^--/) {
        if ($a =~ /^--killdups/) { $killdups=1; }
        if($a =~ /^--verbose/) { $verbose=1; }
    } else { push (@dirs, $a) }
}


for $dir (@dirs) {
    if(!opendir(D, $dir)) {
    warn "$dir: $!";
    next;
    }

    $dir=~s/\/$//;

    @l=( );

    while ($f=readdir(D)) {
        $key="";
        if($f =~ /\.eml$/) {
            $key=readfile("$dir/$f");
        $mtime=(stat($f))[9];
        if($key ne "") {
                push(@l, $_=sprintf "%s:--:%d:--:%s", $key, $mtime, $f);
        } else {
        print "$dir/$f: Not a VCARD?\n";
        }
        }
    }
    closedir(D);

    @l=sort(@l);
    $dups=0;
    $last[0]=$last[1]=$last[2]="";
    foreach $l (@l) {
    @fields=split(/:--:/,$l,3);
        if($last[0] eq $fields[0] && -f "$dir/$fields[2]" && -f "$dir/$last[2]") {
        $dups++;
        print "$dir/$fields[2] is a dup of $dir/$last[2]\n";
        if($verbose==1) { print "    --- $fields[0]\n" }
        if($killdups==1) {
        print "Deleting $dir/$fields[2]\n";
            unlink "$dir/$fields[2]";
        }
    } elsif ($last[0] eq $fields[0]) {
        print "Strangeness -- $dir/$fields[2] dup of $dir/$last[2]??? -- [$fields[0]]\n";
        } else {
        if($verbose==1) {
            print "$dir/$fields[2] is UNIQUE\n";
            print "$fields[0]\n";
        }
            @last=@fields;
        }
    }
    if($killdups==1) {
    print "$dups duplicates removed.\n";
    } else {
    print "$dups duplicates detected.\n";
    }
}

sub readfile {
    local($f)=@_;
    local($k, $l, @l, $begin=0, $wrap, $xfa, $fn, $em, $start, $end, $sum, $org, $tel);

    $wrap=$org=$xfa=$fn=$em=$start=$end=$sum=$tel="";

    open(F, $f) || warn "$f: $!\n";
    @l=<F>;
    close F;
    foreach $l (@l) {
    if($l=~/^BEGIN:VTIMEZONE/) { $TZ=1 }
    elsif($begin==0 && $l=~/^Subject:\s*(.*)\s*$/) {
        $sum=$1; }
    elsif($begin==0 && $l=~/^BEGIN:VCARD/) { $begin=1; }
    elsif($begin==1 && $l=~/^END:VCARD/) { $begin=0; }
    elsif($l=~/^END:VTIMEZONE/) { $TZ=0 } # Ability to skip the timezone section
    elsif($TZ==0 && $begin==0 && $l=~/^BEGIN:VEVENT/) { $begin=1; }
    elsif($TZ==0 && $begin==1 && $l=~/^BEGIN:VEVENT/) { print STDERR "$f: WTF?\n" }
    if($begin==1) {
        if($start eq "" && $l=~/^DTSTART.*[\;\:]([\dT]+)/) {
            $start=$1;
            $start=~s/^\s+|\s+$//g;
            $start=~s/://g;
        } elsif($start eq "" && $l=~/^DTSTART.*[^\d](\d+T\d+)/) {
            $start=$1;
            $start=~s/^\s+|\s+$//g;
            $start=~s/://g;
        } elsif($end eq "" && $l=~/^DTEND.*[^\d](\d+T\d+)/) {
            $end=$1;
            $end=~s/^\s+|\s+$//g;
            $end=~s/://g;
        goto DTEND;
        } elsif($end eq "" && $l=~/^DTEND.*[\;\:]([\dT]+)/) {
            $end=$1;
            $end=~s/^\s+|\s+$//g;
            $end=~s/://g;
        goto DTEND;
        } elsif($org eq "" && $l=~/^ORG:(.*)$/) {
            $org=$1;
            $org=~s/^\s+|\s+$//g;
            $org=~s/://g;
        $wrap="org";
        } elsif($sum eq "" && $l=~/^SUMMARY:(.*)$/) {
            $sum=$1;
            $sum=~s/^\s+|\s+$//g;
            $sum=~s/://g;
        } elsif(($wrap eq "tel" && $l=~/^([A-Z]*\;.*)/) ||
        ($tel eq "" && $l=~/^(TEL\;.*)$/)) {
        $tel.=$1;
            $tel=~s/^\s+|\s+$//g;
            $tel=~s/^[\r\n]//g;
            $tel=~s/://g;
        $wrap="tel";
        } elsif(($wrap eq "org" && $l=~/^([A-Z]*\;.*)/) ||
        ($org eq "" && $l=~/^ORG:\s*(.*)\s*$/)) {
        $org.=$1;
            $org=~s/^\s+|\s+$//g;
            $org=~s/^[\r\n]//g;
            $org=~s/://g;
        $wrap="org";
        } elsif(($wrap eq "fn" && $l=~/^([A-Z]*\;.*)/) ||
        ($fn eq "" && $l=~/^FN:\s*(.*)\s*$/)) {
        $fn.=$1;
            $fn=~s/^\s+|\s+$//g;
            $fn=~s/^[\r\n]//g;
            $fn=~s/://g;
        $wrap="fn";
        } elsif(($wrap eq "em" && $l=~/^([A-Z]*\;.*)/) ||
        ($em eq "" && $l=~/^EMAIL[:;]\s*(.*)\s*$/)) {
        $em.=$1;
            $em=~s/^\s+|\s+$//g;
            $em=~s/^[\r\n]//g;
            $em=~s/://g;
        $wrap="em";
        } elsif(($wrap eq "xfa" && $l=~/^([A-Z]*\;.*)/) || 
        ($xfa eq "" && $l=~/^X-FILE-AS:\s*(.*)\s*$/)) {
        $xfa.=$1;
            $xfa=~s/^\s+|\s+$//g;
            $xfa=~s/^[\r\n]//g;
            $xfa=~s/://g;
        $wrap="xfa";
        } else {
        $wrap="";
        }
        }
    }
DTEND:
    if(($start eq "" || $end eq "") && ($fn eq "" && $em eq "" && $sum eq "" && $org eq "" && $tel eq "")) {
    if($verbose eq 1) {
        print "$f: \$start == [$start]\n";
        print "$f: \$end == [$end]\n";
        print "$f: \$sum == [$sum]\n";
        print "$f: \$fn == [$fn]\n";
        print "$f: \$em == [$em]\n";
        print "$f: \$org == [$org]\n";
        print "$f: \$tel == [$tel]\n";
    }
    return;
    }
    if($start ne "" || $end ne "") {
        $k=$start."-".$end."-".$sum;
    } else {
    $k=$xfa."-".$fn."-".$em."-".$org."-".$tel;
    }
    return $k;
}
+2  A: 

Seeing this code makes me happy I do not have to maintain it. There are a number of specific items you should address before anyone in his right mind should consider working on this:

Use strict and warnings.

Use Getopt::Long for command line arguments.

Declare variables in the smallest applicable scope instead of at the top of a subroutine.

Scope variables lexically using my and do not use local. For more information, see Coping with scoping.

Looking at:

    for $dir (@dirs) {
    if(!opendir(D, $dir)) {
    warn "$dir: $!";
    next;
    }

    $dir=~s/\/$//;

do you know which directory the last s/// is operating on?

Similarly, if you pass multiple directories on the command line, the value in the package global handle D is ambiguous. The structure of the program should be:

use strict; use warnings;
use File::Spec::Functions qw( catfile );
use Getopt::Long;

my %opt = (
    verbose => 0,
    killdupes => 0,
);

GetOptions(\%opt, 'verbose', 'killdupes');

my %files;

for my $dir ( @ARGV ) {
    process_directory( \%files, $dir );
}

# do whatever you want with dupes in %files

use YAML;
print Dump \%files;

sub process_directory {
    my ($files, $dir) = @_;

    my $dir_h;

    unless ( opendir $dir_h, $dir ) {
        warn "Failed to open directory '$dir': $!\n";
        return;
    }

    while ( defined( my $file = readdir $dir_h ) ) {
        my $path = catfile $dir, $file;
        print "$path\n" if $opt{verbose};
        push @{ $files->{ keyof($file) } }, $path;
    }
}

sub keyof {
    return int(rand 2);
}

Finally, it looks like you are parsing/trying to parse Vcard files by hand. There are a bunch of Vcard related modules on CPAN.

Sinan Ünür
Thanks for such a quick reply. Unfortunately I didn't actually write this script it was provided to us as a tool but without the removal of the originating file its worthless to us. I know its pretty ugly but never thought it would take so much work to modify. Its hard to know the directory thats its operating because the '.' in the command tell is to run in its current dir. Typically in our case, its been first_last/Calendar/#msgs where it is looking. I really figured we could remove originating file after its removed all dups for the item :(
Aaron
If you don't have the skills yourself, you should probably find someone who does. We can give you advice and help, but this isn't a free programming service.
brian d foy
thank you for being honest. i'm was never looking for a free programming service. was hoping for advice, that wasn't just redo my entire script. thanks.
Aaron
+2  A: 

Here's a script I have that searches through a bunch of directories and deletes duplicate files. I mostly use it to get rid of duplicated digital photos. I go through all the files and note their MD5 digest. I keep a hash of all the files matching that digest. At the end, I display all the dupes then delete all but the first one that I found.

It's just a quick and dirty script, but the same process might work for you.

#!/usr/local/bin/perl
use strict;
use warnings;

use Digest::MD5;
use File::Spec::Functions;

my @dirs =  @ARGV;
print "Dirs are @dirs\n";

my %digests;
DIR: foreach my $dir ( @dirs )
    {
    opendir my $dh, $dir or do {
        warn "Skipping $dir: $!\n";
        next DIR;
        };

    my @files = 
        map { catfile( $dir, $_ ) }
        grep { ! /^\./ }
        readdir $dh;

    FILE: foreach my $file ( @files )
        {
        next if -d $file;
        my $digest = md5_digest( $file );

        push @{ $digests{ $digest } }, $file;
        }
    }

my $count = 0;
foreach my $digest ( keys %digests )
    {
    next unless @{ $digests{$digest} } > 1;

    local $" = "\n"; # "
    print "Digest: $digest\n@{ $digests{$digest} }\n------\n";

    $count++;

    # unlink everything but the first one
    unlink @{ $digests{$digest} }[1..$#{ $digests{$digest}]
    }

print "There were $count duplicated files\n";

sub md5_digest
    {
    my $file = shift;

    open my($fh), '<', $file or do {
        warn "cannot digest $file: $!";
        return;
        };

    my $ctx = Digest::MD5->new;

    $ctx->add( do { local $/; <$fh> } );

    return $ctx->hexdigest;
    }
brian d foy