ansaurus

Question

Removing files with duplicate content from single directory [Perl, or algorithm]

Answer 1

+7 A:

md5sum *.txt | perl -ne '
   chomp; 
   ($sum, $file) = split(" "); 
   push @{$files{$sum}}, $file; 
   END {
      foreach (keys %files) { 
         shift @{$files{$_}}; 
         unlink @{$files{$_}} if @{$files{$_}};
      }
   }
'

Rudedog 2009-11-17 07:13:21

That is exactly the reason why I've avoided shell scripting and Perl and poke them with sticks. Does that actually do the intended task, or is it the equivalent of "cd / > rm -R *"? The world will never know!(Just being facetious. Though next time, consider commenting your script)

Visionary Software Solutions 2009-11-17 07:16:16

It does the intended task. The algorithm is to build a hash of lists. Each hash key is an md5sum and the elements of the list are the files that have that md5sum. Then, you remove the first element of each list, and the remaining elements are candidates for deletions.

Rudedog 2009-11-17 07:30:55

If you need comments for that script, close your terminal window now and never type in another program.

brian d foy 2009-11-17 07:47:37

I think this could fail if the original file name has a space in it. To solve that, use `split " ", $_, 2` instead, where the 2 stops it splitting more than once (into two pieces).

Kinopiko 2009-11-17 07:55:47

Answer 2

+1 A:

Variations on a theme:

md5sum *.txt | perl -lne '
  my ($sum, $file) = split " ", $_, 2;
  unlink $file if $seen{$sum} ++;
'

No need to go and keep a list, just to remove one from the list and delete the rest; simply keep track of what you've seen before, and remove any file matching a sum that's already been seen. The 2-limit split is to do the right thing with filenames containing spaces.

Also, if you don't trust this, just change the word unlink to print and it will output a list of files to be removed. You can even tee that output to a file, and then rm $(cat to-delete.txt) in the end if it looks good.

hobbs 2009-11-17 07:35:00

Fails on filenames with newlines, carriage returns, or a backslash (because md5sum oddly prints a backslash before the MD5). Although if you're doing that kind of mucking about in files, you probably won't need this question answering anyway...

rjp 2009-11-23 00:13:59

Answer 3

A:

You might want to have a look at how I did to find duplicate files and remove them. Though you have to modify it to your needs.

http://priyank.co.in/remove-duplicate-files

Priyank Bolia 2009-11-17 07:39:24

That's an incredible amount of work to do what Ether did.

brian d foy 2009-11-17 07:45:31

But that is much more extensible and support much more work also. That's sort of baggage comes with generalization. Also, the work is done, you just need to download and run it in most of the times.

Priyank Bolia 2009-11-17 08:00:39

but the script is not yours, why claim the credit here? spam

flamey 2009-11-17 15:03:55

who is taking the claim, the source has been mentioned, and I put on my site, because there is some slight modification. Use a diff tool.

Priyank Bolia 2009-11-18 06:54:41

Answer 4

+8 A:

Here's a general algorithm (edited for efficiency now that I've shaken off the sleepies -- and I also fixed a bug that no one reported)... :)

It's going to take forever (not to mention a lot of memory) if I compare every single file's contents against every other. Instead, why don't we apply the same search to their sizes first, and then compare checksums for those files of identical size.

So then when we ~~md5sum every file (see Digest::MD5)~~ calculate their sizes, we can use a hash table to do our matching for us, storing the matches together in arrayrefs:

use strict;
use warnings;
use Digest::MD5 qw(md5_hex);

my %files_by_size;
foreach my $file (@ARGV)
{
    push @{$files_by_size{-s $file}}, $file;   # store filename in the bucket for this file size (in bytes)
}

Now we just have to pull out the potential duplicates and check if they are the same (by creating a checksum for each, using Digest::MD5), using the same hashing technique:

while (my ($size, $files) = each %files_by_size)
{
    next if @$files == 1;

    my %files_by_md5;
    foreach my $file (@$files_by_md5)
    {
        open my $filehandle, '<', $file or die "Can't open $file: $!";
        # enable slurp mode
        local $/;
        my $data = <$filehandle>;
        close $filehandle;

        my $md5 = md5_hex($data);
        push @{$files_by_md5{$md5}}, $file;       # store filename in the bucket for this MD5
    }

    while (my ($md5, $files) = each %files_by_md5)
    {
        next if @$files == 1;
        print "These files are equal: " . join(", ", @$files) . "\n";
    }
}

-fini

Ether 2009-11-17 07:42:15

I'd stat the files for their sizes and only check the md5 sums if the sizes are identical.

Kinopiko 2009-11-17 07:57:25

Good call, but it makes the organization of work harder -- you don't know that you need to md5sum file #3 until you find file #37 that has the same size :)

hobbs 2009-11-17 07:59:48

Whether or not to do it depends on the size of the files. Typically if I am looking for duplicates it is in large image files, where the md5 bit will be very slow. For text files like program files it's unlikely to be a big problem so the simplistic code is OK.

Kinopiko 2009-11-17 08:03:09

I've edited the code to check for filesize first (and fixed a bug that no one spotted) :)

Ether 2009-11-17 18:01:35

Ether, thank you for your solution! There still are a couple of issues with this code: @$files{$md5} should be @$files, and @$files_by_size{-s $file} and @$files_by_md5{$md5} need extra curlies - @{$files_by_size{-s $file}} - otherwise it doesn't work, at least in Perl v5.10.1 on Win32. Logic look right, though.

flamey 2009-11-17 23:47:11

@flamey: fixed :)

Ether 2009-11-18 00:19:48

Answer 5

+3 A:

Perl, with Digest::MD5 module.

use Digest::MD5 ;
%seen = ();
while( <*> ){
    -d and next;
    $filename="$_"; 
    print "doing .. $filename\n";
    $md5 = getmd5($filename) ."\n";    
    if ( ! defined( $seen{$md5} ) ){
        $seen{$md5}="$filename";
    }else{
        print "Duplicate: $filename and $seen{$md5}\n";
    }
}
sub getmd5 {
    my $file = "$_";            
    open(FH,"<",$file) or die "Cannot open file: $!\n";
    binmode(FH);
    my $md5 = Digest::MD5->new;
    $md5->addfile(FH);
    close(FH);
    return $md5->hexdigest;
}

If Perl is not a must and you are working on *nix, you can use shell tools

find /path -type f -print0 | xargs -0 md5sum | awk '($1 in seen){ print "duplicate: "$2" and "seen[$1] }
( ! ($1 in  seen ) ) { seen[$1]=$2 }'

ghostdog74 2009-11-17 08:06:31

So far I like this solution the best, thank you! Have one question, though: why does $filename is on quotes on line $seen{$md5}="$filename"; ? Also, it seems that start is missing before FH in $md5->addfile(FH); -- addfile(*FH)

flamey 2009-11-18 00:41:23

its just a habit. don't understand your second part by "start is missing..."

ghostdog74 2009-11-18 01:30:07

typo, sorry. i meant star. in strict mode that line fails, it must be *FH. this is in all examples for Digest::MD5 on cpan as well.

flamey 2009-11-18 02:29:51

yes, there should be *FH. but without it works as well.

ghostdog74 2009-11-18 03:17:40

Answer 6

A:

I'd recommend that you do it in Perl, and use File::Find while you're at it.
Who knows what you're doing to generate your list of files, but you might want to combine it with your duplicate checking.

perl -MFile::Find -MDigest::MD5 -e '
my %m;
find(sub{
  if(-f&&-r){
   open(F,"<",$File::Find::name);
   binmode F;
   $d=Digest::MD5->new->addfile(F);
   if(exists($m{$d->hexdigest}){
     $m{$d->hexdigest}[5]++;
     push $m{$d->hexdigest}[0], $File::Find::name;
   }else{
     $m{$d->hexdigest} = [[$File::Find::name],0,0,0,0,1];
   }
   close F
 }},".");
 foreach $d (keys %m) {
   if ($m{$d}[5] > 1) {
     print "Probable duplicates: ".join(" , ",$m{$d}[0])."\n\n";
   }
 }'

dlamblin 2009-11-17 08:10:49

Nobody said anything about files in more than one directory, so File::Find isn't likely to be at all useful.

ysth 2009-11-17 08:23:02

Quite astute, nobody indeed

dlamblin 2009-11-17 09:51:52

Answer 7

+1 A:

Perl is kinda overkill for this:

md5sum * | sort | uniq -w 32 -D | cut -b 35- | tr '\n' '\0' | xargs -0 rm

(If you are missing some of these utilities or they don't have these flags/functions, install GNU findutils and coreutils.)

ysth 2009-11-17 08:21:48

Stick a tr '\n' '\0' before the xargs and use the -0 flag on xargs to avoid problem characters in the filenames. ... | cut -b 35- | tr '\n' '\0' | xargs -0 rm

rjp 2009-11-17 09:38:50

@rjp: thanks, done

ysth 2009-11-17 17:07:18

@rjp: though I don't know of any way to deal with \n in filenames...I wish all the coreutils took -0.

ysth 2009-11-17 17:17:23

rjp 2009-11-23 00:06:02

Answer 8

A:

a bash script is more expressive than perl in this case:

md5sum * |sort -k1|uniq -w32 -d|cut -f2 -d' '|xargs rm

catwalk 2009-11-17 08:52:42

Won't that break on filenames with spaces or other funny characters?

rjp 2009-11-17 09:40:21

uniq -D, not -d. -d only outputs one line for each duplicate, so if three files had the same contents, only one would be deleted

ysth 2009-11-17 17:14:02

@ysth: yes, but -D results in deleting all files, where we want at least one to be left; I guess the easy way to fix will be wrapping all code in a while loop: while md5sum ...| xargs rm;do :;done

catwalk 2009-11-18 07:55:16

@catwalk: -D works for me: `perl -wle'print for 1,1..3'|uniq -D` correctly only prints the two 1's. Are you seeing something different?

ysth 2009-11-23 01:39:11

@ysth: "I'd like to leave one of these files" so "1,1,2,3" should only print one 1, leaving one 1 behind.

rjp 2009-11-23 07:16:57

Answer 9

A:

Here is a way of filtering by size first and by md5 checksum second:

#!/usr/bin/perl

use strict; use warnings;

use Digest::MD5 qw( md5_hex );
use File::Slurp;
use File::Spec::Functions qw( catfile rel2abs );
use Getopt::Std;

my %opts;

getopt('de', \%opts);
$opts{d} = '.' unless defined $opts{d};
$opts{d} = rel2abs $opts{d};

warn sprintf "Checking %s\n", $opts{d};

my $files = get_same_size_files( \%opts );

$files = get_same_md5_files( $files );

for my $size ( keys %$files ) {
    for my $digest ( keys %{ $files->{$size}} ) {
        print "$digest ($size)\n";
        print "$_\n" for @{ $files->{$size}->{$digest} };
        print "\n";
    }
}

sub get_same_md5_files {
    my ($files) = @_;

    my %out;

    for my $size ( keys %$files ) {
        my %md5;
        for my $file ( @{ $files->{$size}} ) {
            my $contents = read_file $file, {binmode => ':raw'};
            push @{ $md5{ md5_hex($contents) } }, $file;
        }
        for my $k ( keys %md5 ) {
            delete $md5{$k} unless @{ $md5{$k} } > 1;
        }
        $out{$size} = \%md5 if keys %md5;
    }
    return \%out;
}

sub get_same_size_files {
    my ($opts) = @_;

    my $checker = defined($opts->{e})
                ? sub { scalar ($_[0] =~ /\.$opts->{e}\z/) }
                : sub { 1 };

    my %sizes;
    my @files = grep { $checker->($_) } read_dir $opts->{d};

    for my $file ( @files ) {
        my $path = catfile $opts->{d}, $file;
        next unless -f $path;

        my $size = (stat $path)[7];
        push @{ $sizes{$size} }, $path;
    }

    for my $k (keys %sizes) {
        delete $sizes{$k} unless @{ $sizes{$k} } > 1;
    }

    return \%sizes;
}

Sinan Ünür 2009-11-17 12:47:12

ansaurus

tags:

views:

answers:

Removing files with duplicate content from single directory [Perl, or algorithm]

related questions