views:

190

answers:

5

I have two folders with approximately 10,000 files each. I'd like to write a script or program that can tell me if these folders are in sync and then tell me which files are missing from each to make them in sync.

Therefore, after generating a list of files, what is the fastest algorithm to sort them for unique files? What I'm thinking right now is comparing the first file on each list then if they are different remove one until they are the same, then remove both from the list (because they are not unique.)

Is there a faster algorithm than this?

+1  A: 

Generate md5 or sha1 checksums and compare those. Something like this

cd dir1; md5sum * | sort > /tmp/hash1
cd dir2; md5sum * | sort > /tmp/hash2
diff /tmp/hash1 /tmp/hash2  # could also use comm

If you're only worried about the names, not about the contents of the files, then diff dir1 dir2 works fine.

lhf
He needs the names of the files...
Itay Moav
If the files are the same then the hashes should be the same
Mimisbrunnr
@Mimisbrunnr: Relevant quote :"and then tell me what files are missing from each to make them in sync" hash won't help here, unless you mean to do it in two steps, statistically assuming most cases both dirs will be in sync.
Itay Moav
+4  A: 

If you're in C, use qsort() to sort the file lists in ascending order, then use a kind of "merge:

Have two pointers starting at the beginning of each list. Do the following:

  • if the names are same, then this name exists in both lists - advance both pointers
  • if the name in list1 > name in list2, then list 2 is the only one that has it - advance list2's pointer
  • otherwise the name in list1 is in list1 only - advance list1's pointer
  • repeat

When you're at the end of one of the lists, all the elements left in the other are obviously missing from the first one.

Alternatively, you can combine both lists while keeping track what list each element comes from. Then sort the combined list. Scan the sorted list. If you see two instances of same value, then it was in both lists. Otherwise you'll know which list it comes from.

Arkadiy
+8  A: 

diff -s [path1] [path2]

Brian DesArmo
I like no-code answers.
Peter Wone
+3  A: 

Also, another approach you can follow is

If there is no constraint on space, I would go for putting the files of one folder in a hash .. It will take O(N) time and some space..! then I will take each file of the second folder and check if the key exists in the first hash .. this is again O(1) time operation ..! problem solved in O(N) time.. but this is big on space requirement ..

repeat the same in reverse oder depends if u want speed or space..!

+1  A: 

If you need this information just to synchronize them, you can do the comparison and copying in a single pass:

  • Get a directory listing from both directories
  • sort both listings lexicographically
  • loop simultaneously through both lists:
    • if one of the lists is empty, stop the loop
    • if both elements are the same: step both indices
    • else take the lexicographically lower element, copy it over, and step only this index
  • copy any remaining elements of the non-empty list, if existent

If you want to do it in two passes, or need the information what gets copied where, substitute "copy over" with "put name and direction into result list".

Svante