views:

21

answers:

2

Hi, I'd like to write a script that traverses a file tree, calculates a hash for each file, and inserts the hash into an SQL table together with the file path, such that I can then query and search for files that are identical. What would be the recommended hash function or command like tool to create hashes that are extremely unlikely to be identical for different files? Thanks B

A: 

you can use md5 hash or sha1

  function process_dir($path) {

    if ($handle = opendir($path)) {
      while (false !== ($file = readdir($handle))) {
        if ($file != "." && $file != "..") {
           if (is_dir($path . "/" . $file)) {
              process_dir($path . "/" . $file);
           } else {
              //you can change md5 to sh1
              // you can put that hash into database
              $hash = md5(file_get_contents($path . "/" . $file)); 
           }
        }
      }
      closedir($handle);
  }
 }

if you working in Windows change slashes to backslashes.

jcubic
A: 

I've been working on this problem for much too long. I'm on my third (and hopefully final) rewrite.

Generally speaking, I recommend SHA1 because it has no known collisions (whereas MD5 collisions can be found in minutes), and SHA1 doesn't tend to be a bottleneck when working with hard disks. If you're obsessed with getting your program to run fast in the presence of a solid-state drive, either go with MD5, or waste days and days of your time figuring out how to parallelize the operation. In any case, do not parallelize hashing until your program does everything you need it to do.

Also, I recommend using sqlite3. When I made my program store file hashes in a PostgreSQL database, the database insertions were a real bottleneck. Granted, I could have tried using COPY (I forget if I did or not), and I'm guessing that would have been reasonably fast.

If you use sqlite3 and perform the insertions in a BEGIN/COMMIT block, you're probably looking at about 10000 insertions per second in the presence of indexes. However, what you can do with the resulting database makes it all worthwhile. I did this with about 750000 files (85 GB). The whole insert and SHA1 hash operation took less than an hour, and it created a 140MB sqlite3 file. However, my query to find duplicate files and sort them by ID takes less than 20 seconds to run.

In summary, using a database is good, but note the insertion overhead. SHA1 is safer than MD5, but takes about 2.5x as much CPU power. However, I/O tends to be the bottleneck (CPU is a close second), so using MD5 instead of SHA1 really won't save you much time.

Joey Adams
@joey how far along are you with your tool? I've been looking for a simple tool that does this for ages but couldn't find anything online beyond the obvious "compare two directories" shareware tools.
b20000
My program is already capable of loading file tree information into a database and hashing files; it works fabulously. I'm currently working on the problem of replacing duplicate files with hardlinks. Note that my program will probably only work on Linux and other Unix-like systems because it's tied to the stat structure filled in by the [`lstat()`](http://linux.die.net/man/2/lstat) function.
Joey Adams
Also, it has absolutely no frontend yet; you would have to paste in the path you want to scan, and for more complicated operations, learn how to work with Haskell code.
Joey Adams