views:

46

answers:

3

Hi there, I'm doing a large data migration between two file systems (let's call them F1 and F2) on a Linux system which will necessarily involve copying the data verbatim into a differently-structured hierarchy on F2 and changing the file names.

I'd like to write a script to generate a list of files which are in F1 but not in F2, i.e. the ones which weren't copied by the migration script into the new hierarchy, so that I can go back and migrate them manually. Unfortunately for reasons not worth going into, the migration script can't be modified to list files that it doesn't migrate. My question differs from this previously answered one because of the fact that I cannot rely on filenames as a comparison.

I know the basic outline of the process would be:

  1. Generate a list of checksums for all files, recursing through F1
  2. Do the same for F2
  3. Compare the lists and generate a negative intersection of the checksums, ignoring the file names, to find files which are in F1 but not in F2.

I'm kind of stuck getting past that stage, so I'd appreciate any pointers on which tools to use. I think I need to use the 'comm' command to compare the list of file checksums, but since md5sum, sha512sum and the like put the file name next to the checksum, I can't see a way to get it to bring me a useful comparison. Maybe awk is the way to go?

I'm using Red Hat Enterprise Linux 5.x.

Thanks.

+1  A: 

Perhaps take a look at the source code of FSLint for pointers: http://code.google.com/p/fslint/source/browse/trunk/fslint/findup

Ash Kim
+2  A: 

On F1:

# find / -type f -exec md5sum {} + > F1

On F2:

# find / -type f -exec md5sum {} + > F2

then:

# diff F1 F2

You might want to check more options for find, this line only finds regular files.

jgr
that's awesome!
Ash Kim
Thanks user362458, that's useful - however because 'md5sum' puts the name of the file next to the checksum, no line in the file 'F1' will match any line in 'F2', even if the checksums are identical.
grw
ah, I read your post a bit too fast, if a new hier is created you'd have to go for a solution like Unknown's at the bottom, preferably without the UUOC ;)
jgr
+1  A: 

You can do something like this:

f1# find yourrootdir -type f -exec sha1sum {} >> initial_files \; 
f1# ...copy initial_files to machine f2...
f1# ...start copy...
f2# find yournewrootdir -type f -exec sha1sum {} >> final_files \;
f2# sort initial_files > INITIAL
f2# sort final_files > FINAL
f2# for sha1 in `comm -23 <(cat INITIAL | awk '{print $1}') <(cat FINAL | awk '{print $1}')`; do grep $sha1 INITIAL; done

This will show the lines in "initial_files" that don't have the SHA1 in the final_files.

The last line uses only the sha1sums to execute a comm command, then greps in initial_files each sha1sum that's missing.

Unknown
That's absolutely great - exactly what I was looking for. Does the job wonderfully :)
grw
Hi Unknown,I've made your solution into a script, and licensed it under the GPL. I hope this is okay with you - it seems like the best way to ensure that anyone can use it. If that's a problem, let me know and I'll take it down.http://github.com/capncodewash/Misc-shell-scripts/blob/master/find_missing_files.sh
grw
Don't worry about it, it's not like it's some top secret algorithm.. :)
Unknown