views:

122

answers:

3

The following code prints me a long list of files with hashes and file names

md5sum *.java

I have tried unsuccessfully to list the lines where identical hashes occur, so that I can then remove identical files.

How can you filter and delete identical files which have same content?

+1  A: 

Even beter:

md5sum *.java | sort | uniq -d

That only prints the duplicate lines.

Zsolt Botykai
The code gives me nothing as an output. The reason may be that although two files have the same hash, they still have different names. We should first somehow filter the names out.
Masi
+4  A: 

This should work:

md5sum *.java | sort | uniq -d -w32

This tells uniq to only compare the first 32 character, which is only the md5 sum, not the filenames.

EDIT: If -w isn't available, try:

md5sum *.java | awk '{print $1}' | sort | uniq -d

The downside is that you won't know which files have these duplicate checksums... anyway, if there aren't too much checksums, you can use

md5sum *.java | grep 0bee89b07a248e27c83fc3d5951213c1

to get the filenames afterwards (the checksum above is just an example). I'm sure there's a way to do all this in a shell script, too.

schnaader
Thank you! I observed that Mac does not have the option -w. I think that the reason is that they do not want many commands have same features. How can you parse the name out without -w option?
Masi
Just forgot to add -w32 ;-)
Zsolt Botykai
By the way, before anyone tries to bruteforce the md5sum above, it's for a file that contains "abc" ;)
schnaader
Thank you! I love Awk :)
Masi
+1  A: 

This lists all the files, putting a blank line between duplicates:
$ md5sum *.txt | sort | perl -pe '($y)=split; print "\n" unless $y eq $x; $x=$y'

05aa3dad11b2d97568bc506a7080d4a3 b.txt

2a517c8a78f1e1582b4ce25e6a8e4953 n.txt

e1254aebddc54f1cbc9ed2eacce91f28 a.txt
e1254aebddc54f1cbc9ed2eacce91f28 k.txt
e1254aebddc54f1cbc9ed2eacce91f28 p.txt
$

To only print 1st of each group:
$ md5sum *.txt | sort | perl -ne '($y,$f)=split; print "$f\n" unless $y eq $x; $x=$y'
b.txt
n.txt
a.txt
$

if you're brave, change the "unless" to "if" and then

$ rm md5sum ...

to delete all but the first of each group

hornlo
there should be backticks around the "md5sum ..." in the "rm" command line, but I can't edit again
hornlo