views:

201

answers:

5

I have a large directory that contains only stuff in CS and Math. It is over 16GB in size. The types are text, png, pdf and chm. I have currently two branches: a branch of my brother's and mine. The initial files were the same. I need to compare them. I have tried to use Git, but there is a long loading time.

What is the best way to compare two big directories?

[Mixed Solution]

  1. Do a "ls -R > different_files" in both directories [1]
  2. "sdiff <(echo file1 | md5deep) <(echo file2 | md5deep)" [2]

What do you think? Any drawbacks?

[1] thanks to Paul Tomblin [2] great thanks to all repliers!

+2  A: 

How to compare 2 folders without pre-existing commands/products:

Simply create a program that scans each directory and creates a file hash of each file. It outputs a file with each relative file path and the file hash.

Run this program on both folders.

Then you simply compare the 2 output files to see if they are the same. To compare those 2 files you just load them into a string and do a string compare.

The hashing algorithm you use doesn't matter. You can use MD5, SHA, CRC, ... You could also use the file size in the output files to help reduce the chance of collisions.

How to compare 2 folders with pre-existing commands/products:

Now if you just want a program that does it, use diff -r or windiff for windows based systems.

Brian R. Bondy
+1  A: 

Are you just trying to discover what files are present in one that aren't in the other, and vice versa? A couple of suggestions:

  1. Do a "ls -R" in both directories, redirect to files, and diff the files.

  2. Do a "rsync -n" between them to see what rsync would have to copy if it were to be allowed to copy. (-n means don't do the rsync, just show you what it would do if you ran it without the -n)

Paul Tomblin
Thank you! I mixed your solution and Brian's solution to get my solution:#########1. Do a "ls -R > different_files" in both directories #########2. sdiff <(echo file1 | md5deep) <(echo file2 | md5deep)What do you think? Any drawbacks?
Masi
+1  A: 

I would diffing by comparing the output of md5sum * | sort

That will take you to the files that are different/missing

flybywire
A: 

Use md5deep to create recursive md5sum listings of every file in those directories.

You can the use a diff tool to compare the generated listings.

froh42
A: 

I know this question has already been answered, however if you are not into writing such a tool yourself, there's a very well working open source project by the name of tardiff available on sourceforge which basically does exactly what you want, and even supports automated creation of patches (in tar format obviously) to account for differences.

Hope this helps

none