if you are working on linux/*nix systems, you can use sha
tools like sha512sum
, now that md5 can be broken.
find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in seen)){seen[$1]=$2}'
if you want to work with Python, a simple implementation
import hashlib,os
def sha(filename):
''' function to get sha of file '''
d = hashlib.sha512()
try:
d.update(open(filename).read())
except Exception,e:
print e
else:
return d.hexdigest()
s={}
path=os.path.join("/home","path1")
for r,d,f in os.walk(path):
for files in f:
filename=os.path.join(r,files)
digest=sha(filename)
if not s.has_key(digest):
s[digest]=filename
else:
print "Duplicates: %s <==> %s " %( filename, s[digest])
if you think that sha512sum is not enough, you can use unix tools like diff, or filecmp (Python)