views:

1041

answers:

3

hello there to all, well, I am not a Programming guru in python but manage to write some of them :) The one I am writing is finding and removing the duplicate files from the folder. I have multiple copies of mp3 files, and same is the case with other files. So if anyone can help solving this problem then I will be thankful. I am using sh1 algorithm.

Thanking you all in advance :)

+5  A: 

I wrote one in Python some time ago -- you're welcome to use it.

check_path = (lambda filepath, hashes, p = sys.stdout.write:
        (lambda hash = hashlib.sha1 (file (filepath).read ()).hexdigest ():
                ((hash in hashes) and (p ('DUPLICATE FILE\n'
                                          '   %s\n'
                                          'of %s\n' % (filepath, hashes[hash])))
                 or hashes.setdefault (hash, filepath)))())

scan = (lambda dirpath, hashes = {}: 
                map (lambda (root, dirs, files):
                        map (lambda filename: check_path (os.path.join (root, filename), hashes), files), os.walk (dirpath)))

((len (sys.argv) > 1) and scan (sys.argv[1]))
John Millikin
+1 Ok, that's pretty neat.
altCognito
I can't follow what's happening there. If you get a chance, could you maybe explain a little of what's going on?
tgray
-1: unreadable code
nosklo
-1: Lambdas (ToT)
S.Lott
+0: looks like lisp
cobbal
I wonder if the author is furious for being downvoted :)
Anonymous
+1: looks like lisp and answers the question.
Joe
Anon -- I posted this just to be silly. It's a rather difficult-to-read solution to an easy problem. The original question could probably be solved in a few lines of idiomatic Python, if the original poster bothered to spend a few minutes of quality time with Google.
John Millikin
+5  A: 
def remove_duplicates(dir):
   unique = []
   for filename in os.listdir(dir)
      if os.path.isfile(filename):
        filehash = md5.md5(file(filename).read()).hexdigest()
        if filehash not in unique: unique.append(filehash)
        else: os.remove(filename)

//edit:

for mp3 you may be also interested in this topic http://stackoverflow.com/questions/476227/detect-duplicate-mp3-files-with-different-bitrates-and-or-different-id3-tags

zalew
of course you can replace the md5 hashing with sha1
zalew
For performance, you should probably change unique to be a set (though it probably won't be a big factor unless there are lots of small files). Also, your code will fail if there is a directory in the dir. Check os.path.isfile() before you process them.
Brian
yep, this code it's more like a basis. I added isfile as you suggested.
zalew
+5  A: 

Recursive folders version:

This version uses the file size and a hash of the contents to find duplicates. You can pass it multiple paths, it will scan all paths recursively and report all duplicates found.

import sys
import os
import hashlib

def chunk_reader(fobj, chunk_size=1024):
    """Generator that reads a file in chunks of bytes"""
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes = {}
    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                hashobj = hash()
                for chunk in chunk_reader(open(full_path, 'rb')):
                    hashobj.update(chunk)
                file_id = (hashobj.digest(), os.path.getsize(full_path))
                duplicate = hashes.get(file_id, None)
                if duplicate:
                    print "Duplicate found: %s and %s" % (full_path, duplicate)
                else:
                    hashes[file_id] = full_path

if sys.argv[1:]:
    check_for_duplicates(sys.argv[1:])
else:
    print "Please pass the paths to check as parameters to the script"
nosklo