ansaurus

Question

What is the fastest way to compare two list of items?

Answer 1

+1 A:

Generate md5 or sha1 checksums and compare those. Something like this

cd dir1; md5sum * | sort > /tmp/hash1
cd dir2; md5sum * | sort > /tmp/hash2
diff /tmp/hash1 /tmp/hash2  # could also use comm

If you're only worried about the names, not about the contents of the files, then diff dir1 dir2 works fine.

lhf 2010-03-14 02:28:31

He needs the names of the files...

Itay Moav 2010-03-14 02:35:55

If the files are the same then the hashes should be the same

Mimisbrunnr 2010-03-14 02:37:12

@Mimisbrunnr: Relevant quote :"and then tell me what files are missing from each to make them in sync" hash won't help here, unless you mean to do it in two steps, statistically assuming most cases both dirs will be in sync.

Itay Moav 2010-03-14 02:51:33

Answer 2

+4 A:

If you're in C, use qsort() to sort the file lists in ascending order, then use a kind of "merge:

Have two pointers starting at the beginning of each list. Do the following:

if the names are same, then this name exists in both lists - advance both pointers
if the name in list1 > name in list2, then list 2 is the only one that has it - advance list2's pointer
otherwise the name in list1 is in list1 only - advance list1's pointer
repeat

When you're at the end of one of the lists, all the elements left in the other are obviously missing from the first one.

Alternatively, you can combine both lists while keeping track what list each element comes from. Then sort the combined list. Scan the sorted list. If you see two instances of same value, then it was in both lists. Otherwise you'll know which list it comes from.

Arkadiy 2010-03-14 02:47:18

Answer 3

+8 A:

diff -s [path1] [path2]

Brian DesArmo 2010-03-14 04:15:24

I like no-code answers.

Peter Wone 2010-03-14 04:55:40

Answer 4

+3 A:

Also, another approach you can follow is

If there is no constraint on space, I would go for putting the files of one folder in a hash .. It will take O(N) time and some space..! then I will take each file of the second folder and check if the key exists in the first hash .. this is again O(1) time operation ..! problem solved in O(N) time.. but this is big on space requirement ..

repeat the same in reverse oder depends if u want speed or space..!

2010-03-14 04:40:13

Answer 5

+1 A:

If you need this information just to synchronize them, you can do the comparison and copying in a single pass:

Get a directory listing from both directories
sort both listings lexicographically
loop simultaneously through both lists:
- if one of the lists is empty, stop the loop
- if both elements are the same: step both indices
- else take the lexicographically lower element, copy it over, and step only this index
copy any remaining elements of the non-empty list, if existent

If you want to do it in two passes, or need the information what gets copied where, substitute "copy over" with "put name and direction into result list".

Svante 2010-03-14 12:37:36

ansaurus

tags:

views:

answers:

What is the fastest way to compare two list of items?

related questions