ansaurus

Question

Finding duplicate files by content across multiple directories

Answer 1

+3 A:

You can traverse the folders recursively and find the MD5 of each file and then look for duplicate MD5 values, this will give duplicate files content wise. Which language do you want to implement this in?

The following is the Perl program to do the above thing:

use strict;
use File::Find;
use Digest::MD5 qw(md5);    

my @directories_to_search = ('a','e');
my %hash;

find(\&wanted, @directories_to_search);

sub wanted  {

        chdir $File::Find::dir;
        if( -f $_) {
                my $con = '';
                open F,"<",$_ or die;
                while(my $line = <F>) {
                        $con .= $line;
                }
                close F;
                if($hash{md5($con)}) {
                        print "Dup found: $File::Find::name and $hash{md5($con)}\n";
                } else {
                        $hash{md5($con)} = $File::Find::name;
                }
        }
}

codaddict 2010-03-08 03:56:26

language is not a constraint. i can use perl or python and even open to C++ or Java.

gagneet 2010-03-08 04:00:03

@down voter: care to explain ?

codaddict 2010-03-08 04:41:25

md5 can still be used for the casual user...so i will upvote for this one. However, you might want to try switching to sha algorithm.

ghostdog74 2010-03-08 04:47:07

Answer 2

+3 A:

Do a recursive search through all the files, sorting them by size, any byte sizes with two or more files, do an MD5 hash or a SHA1 hash computation to see if they are in fact identical.

Regex will not help with this problem.

There are plenty of code examples on the net, I don't have time to knock out this code now. (This will probably elicit some downvotes - shrug!)

benPearce 2010-03-08 03:56:50

+1, for checking the size before using the hash.

codaddict 2010-03-08 03:58:58

Checking the size first is useful in some cases, but counterproductive in others. For plain text, in particular, it's better to read through the file and ignore white space, so if (for example) somebody has converted line endings, it doesn't affect your comparison.

Jerry Coffin 2010-03-08 04:25:08

Then you are taking context and file contents into the equation - thereby changing the definition of identical

benPearce 2010-03-08 04:35:10

Answer 3

+3 A:

if you are working on linux/*nix systems, you can use sha tools like sha512sum, now that md5 can be broken.

find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in  seen)){seen[$1]=$2}'

if you want to work with Python, a simple implementation

import hashlib,os
def sha(filename):    
    ''' function to get sha of file '''
    d = hashlib.sha512()
    try:
        d.update(open(filename).read())
    except Exception,e:
        print e
    else:
        return d.hexdigest()
s={}
path=os.path.join("/home","path1")
for r,d,f in os.walk(path):
    for files in f:
        filename=os.path.join(r,files)
        digest=sha(filename)
        if not s.has_key(digest):
            s[digest]=filename
        else:
            print "Duplicates: %s <==> %s " %( filename, s[digest])

if you think that sha512sum is not enough, you can use unix tools like diff, or filecmp (Python)

ghostdog74 2010-03-08 04:00:24

Answer 4

+1 A:

MD5 is a good way to find two identical file but it is not sufficient to assume that two file are identical! (in practice the risk is small but it exist) so you also need to compare the content

PS: Also if you just want to check the text content, if the return character '\n' is different in windows and linux

EDIT:

Reference: two different file can have the same md5 checksum: (MD5 collision vulnerability (wikipedia))

However, now that it is easy to generate MD5 collisions, it is possible for the person who created the file to create a second file with the same checksum, so this technique cannot protect against some forms of malicious tampering. Also, in some cases the checksum cannot be trusted (for example, if it was obtained over the same channel as the downloaded file), in which case MD5 can only provide error-checking functionality: it will recognize a corrupt or incomplete download, which becomes more likely when downloading larger files.

Phong 2010-03-08 04:03:24

Why the downvote ???

Phong 2010-03-08 04:07:53

if the content is not identical, md5 will not be identical.

ghostdog74 2010-03-08 04:12:30

That is not true, two different file CAN HAVE THE SAME MD5 HASH (the opposite is not true), this is call a collision vulnerability (the risk is small but it exist): ref: http://en.wikipedia.org/wiki/MD5#Collision_vulnerability

Phong 2010-03-08 04:17:18

another reference: http://en.wikipedia.org/wiki/Hash_collision

Phong 2010-03-08 04:23:23

Upvote. Phong is right.

Daniel S 2010-03-08 04:23:59

@Daniel: thanks

Phong 2010-03-08 04:24:49

although not remotely impossible, i will give it to you to md5 can be broken by the casual user. however, a higher algorithm should be used , such as sha512 if OP is paranoid about md5 collisions.

ghostdog74 2010-03-08 04:35:48

@ghostdog74: What algorithm you are using, every hash have a chance to have collisions augmenting the number of bit (or other technique) will only tend to reduce the risk of such collision. But in the scope of this application, a agree with you it is thinking too much. But you cant rely on hash verification only if the system if it was a life threatening application

Phong 2010-03-08 05:06:50

@Phong, i changed my code to use sha512. Now, the hash is long enough..

ghostdog74 2010-03-08 05:18:16

@Phong: If you're worried about hash collisions for this question, you should be spending your life savings on lottery tickets, because there's a much better chance of you winning that.

Roger Pate 2010-03-08 05:44:04

ansaurus

tags:

views:

answers:

Finding duplicate files by content across multiple directories

related questions