ansaurus

Question

Finding duplicate files and removing them.

Answer 1

+5 A:

I wrote one in Python some time ago -- you're welcome to use it.

check_path = (lambda filepath, hashes, p = sys.stdout.write:
        (lambda hash = hashlib.sha1 (file (filepath).read ()).hexdigest ():
                ((hash in hashes) and (p ('DUPLICATE FILE\n'
                                          '   %s\n'
                                          'of %s\n' % (filepath, hashes[hash])))
                 or hashes.setdefault (hash, filepath)))())

scan = (lambda dirpath, hashes = {}: 
                map (lambda (root, dirs, files):
                        map (lambda filename: check_path (os.path.join (root, filename), hashes), files), os.walk (dirpath)))

((len (sys.argv) > 1) and scan (sys.argv[1]))

John Millikin 2009-04-14 17:50:52

+1 Ok, that's pretty neat.

altCognito 2009-04-14 17:56:37

I can't follow what's happening there. If you get a chance, could you maybe explain a little of what's going on?

tgray 2009-04-14 18:11:53

-1: unreadable code

nosklo 2009-04-14 18:32:36

-1: Lambdas (ToT)

S.Lott 2009-04-14 18:54:32

+0: looks like lisp

cobbal 2009-04-14 19:08:42

I wonder if the author is furious for being downvoted :)

Anonymous 2009-04-14 19:33:18

+1: looks like lisp and answers the question.

Joe 2009-04-14 20:07:46

Anon -- I posted this just to be silly. It's a rather difficult-to-read solution to an easy problem. The original question could probably be solved in a few lines of idiomatic Python, if the original poster bothered to spend a few minutes of quality time with Google.

John Millikin 2009-04-14 20:12:49

Answer 2

+5 A:

def remove_duplicates(dir):
   unique = []
   for filename in os.listdir(dir)
      if os.path.isfile(filename):
        filehash = md5.md5(file(filename).read()).hexdigest()
        if filehash not in unique: unique.append(filehash)
        else: os.remove(filename)

//edit:

for mp3 you may be also interested in this topic http://stackoverflow.com/questions/476227/detect-duplicate-mp3-files-with-different-bitrates-and-or-different-id3-tags

zalew 2009-04-14 18:51:46

of course you can replace the md5 hashing with sha1

zalew 2009-04-14 18:57:39

For performance, you should probably change unique to be a set (though it probably won't be a big factor unless there are lots of small files). Also, your code will fail if there is a directory in the dir. Check os.path.isfile() before you process them.

Brian 2009-04-14 19:31:47

yep, this code it's more like a basis. I added isfile as you suggested.

zalew 2009-04-14 20:07:01

Answer 3

+5 A:

Recursive folders version:

This version uses the file size and a hash of the contents to find duplicates. You can pass it multiple paths, it will scan all paths recursively and report all duplicates found.

import sys
import os
import hashlib

def chunk_reader(fobj, chunk_size=1024):
    """Generator that reads a file in chunks of bytes"""
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes = {}
    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                hashobj = hash()
                for chunk in chunk_reader(open(full_path, 'rb')):
                    hashobj.update(chunk)
                file_id = (hashobj.digest(), os.path.getsize(full_path))
                duplicate = hashes.get(file_id, None)
                if duplicate:
                    print "Duplicate found: %s and %s" % (full_path, duplicate)
                else:
                    hashes[file_id] = full_path

if sys.argv[1:]:
    check_for_duplicates(sys.argv[1:])
else:
    print "Please pass the paths to check as parameters to the script"

nosklo 2009-04-14 19:00:37

ansaurus

tags:

views:

answers:

Finding duplicate files and removing them.

Recursive folders version:

related questions