views:

97

answers:

6

Hi all,

I'm dealing with a large amount (30,000) files of about 10MB in size. Some of them (I estimate 2%) are actually duplicated, and I need to keep only a copy for every duplicated pair (or triplet). Would you suggest me an efficient way to do that? I'm working on unix.

Thank you :-)

+2  A: 

I would write a script to create a hash of every file. You could store the hashes in a set, iterate over the files, and where a file hashes to a value already found in the set, delete the file. This would be trivial to do in Python, for example.

For 30,000 files, at 64 bytes per hash table entry, you're only looking at about 200 megabytes.

Joe
What's a metabyte? Some sort of idealised byte? And your solution only works if you have a perfect hash function.
anon
What *isn't* a metabyte? Fixed. The paranoid could compare the contents of the files in the case of deleting. Adding an extra hash could also help.
Joe
@Neil If you use a modern, currently unbroken cryptographic hash function and you find a collision, your algorithm breaks down but you have gained a cryptographic paper, so it's all win. It is worth comparing the supposed duplicates before erasing one of them, though.
Pascal Cuoq
Proper cryptographic hash functions are not perfect by a simple counting argument, but you can treat them as they were for all intents and purposes.
Pascal Cuoq
@Pascal There certainly can be a collision. Consider that a file can be seen as a very large single binary number, much larger than the hash. Collisions are thus inevitable, because the hash loses information.
anon
+5  A: 

Can I suggest Fdupes?

Il-Bhima
Great! I didn't know this :-)Definitely the fastest way.
Thrawn
+1  A: 

Write a script that first compares file sizes, then MD5 checksums (caching them, of course) and, if you're very anxious about losing data, bites the bullet and actually compares duplicate candidates byte for byte. If you have no additional knowledge about how the files came to be etc., it can't really be done much more efficiently.

Kilian Foth
It's a high-througput download from different sources, so I got some redundancy.I'll try md5sum, so I should get a hashcode for all of them. I'll let you know if it works :-)
Thrawn
+1  A: 

Find possible duplicate files:

find DIR -type f -exec sha1sum "{}" \; | sort | uniq -d -w40

Now you can use cmp to check that the files are really identical.

Aaron Digulla
-w is a feature of gnu uniq; -d will only find consecutive duplicates, so you'd have to sort first
hop
You're right. Fixed.
Aaron Digulla
+2  A: 

you can try this snippet to get all duplicates first before removing.

find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in  seen)){seen[$1]=$2}' 
ghostdog74
A: 

Save all the file names in an array .Then traverse the array .In each iteration compare the file contents with the other file contents by using the command 'md5sum'.If it is same then remove the file.

For example here file 'b' is the duplicate of file 'a'.so the md5sum will be the same for both the files.

karthi_ms
you might want to consider the algorithmical complexity of that particular approach...
hop