ansaurus

Question

Answer 1

+2 A:

I would write a script to create a hash of every file. You could store the hashes in a set, iterate over the files, and where a file hashes to a value already found in the set, delete the file. This would be trivial to do in Python, for example.

For 30,000 files, at 64 bytes per hash table entry, you're only looking at about 200 megabytes.

Joe 2010-03-08 10:23:54

What's a metabyte? Some sort of idealised byte? And your solution only works if you have a perfect hash function.

anon 2010-03-08 10:30:38

What *isn't* a metabyte? Fixed. The paranoid could compare the contents of the files in the case of deleting. Adding an extra hash could also help.

Joe 2010-03-08 10:40:20

@Neil If you use a modern, currently unbroken cryptographic hash function and you find a collision, your algorithm breaks down but you have gained a cryptographic paper, so it's all win. It is worth comparing the supposed duplicates before erasing one of them, though.

Pascal Cuoq 2010-03-08 12:28:10

Proper cryptographic hash functions are not perfect by a simple counting argument, but you can treat them as they were for all intents and purposes.

Pascal Cuoq 2010-03-08 12:29:26

@Pascal There certainly can be a collision. Consider that a file can be seen as a very large single binary number, much larger than the hash. Collisions are thus inevitable, because the hash loses information.

anon 2010-03-08 12:44:11

Answer 2

+5 A:

Can I suggest Fdupes?

Il-Bhima 2010-03-08 10:24:11

Great! I didn't know this :-)Definitely the fastest way.

Thrawn 2010-03-08 12:39:35

Answer 3

+1 A:

Write a script that first compares file sizes, then MD5 checksums (caching them, of course) and, if you're very anxious about losing data, bites the bullet and actually compares duplicate candidates byte for byte. If you have no additional knowledge about how the files came to be etc., it can't really be done much more efficiently.

Kilian Foth 2010-03-08 10:24:11

It's a high-througput download from different sources, so I got some redundancy.I'll try md5sum, so I should get a hashcode for all of them. I'll let you know if it works :-)

Thrawn 2010-03-08 10:43:04

Answer 4

+1 A:

Find possible duplicate files:

find DIR -type f -exec sha1sum "{}" \; | sort | uniq -d -w40

Now you can use cmp to check that the files are really identical.

Aaron Digulla 2010-03-08 10:24:48

-w is a feature of gnu uniq; -d will only find consecutive duplicates, so you'd have to sort first

hop 2010-03-08 11:33:30

You're right. Fixed.

Aaron Digulla 2010-03-08 12:11:04

Answer 5

+2 A:

you can try this snippet to get all duplicates first before removing.

find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in  seen)){seen[$1]=$2}'

ghostdog74 2010-03-08 10:30:08

Answer 6

A:

Save all the file names in an array .Then traverse the array .In each iteration compare the file contents with the other file contents by using the command 'md5sum'.If it is same then remove the file.

For example here file 'b' is the duplicate of file 'a'.so the md5sum will be the same for both the files.

karthi_ms 2010-03-08 11:02:17

you might want to consider the algorithmical complexity of that particular approach...

hop 2010-03-08 11:26:26

ansaurus

tags:

views:

answers:

Remove identical files in UNIX

related questions