ansaurus

Question

Answer 1

A:

you can use md5 hash or sha1

  function process_dir($path) {

    if ($handle = opendir($path)) {
      while (false !== ($file = readdir($handle))) {
        if ($file != "." && $file != "..") {
           if (is_dir($path . "/" . $file)) {
              process_dir($path . "/" . $file);
           } else {
              //you can change md5 to sh1
              // you can put that hash into database
              $hash = md5(file_get_contents($path . "/" . $file)); 
           }
        }
      }
      closedir($handle);
  }
 }

if you working in Windows change slashes to backslashes.

jcubic 2010-09-26 00:21:55

Answer 2

A:

I've been working on this problem for much too long. I'm on my third (and hopefully final) rewrite.

Generally speaking, I recommend SHA1 because it has no known collisions (whereas MD5 collisions can be found in minutes), and SHA1 doesn't tend to be a bottleneck when working with hard disks. If you're obsessed with getting your program to run fast in the presence of a solid-state drive, either go with MD5, or waste days and days of your time figuring out how to parallelize the operation. In any case, do not parallelize hashing until your program does everything you need it to do.

Also, I recommend using sqlite3. When I made my program store file hashes in a PostgreSQL database, the database insertions were a real bottleneck. Granted, I could have tried using COPY (I forget if I did or not), and I'm guessing that would have been reasonably fast.

If you use sqlite3 and perform the insertions in a BEGIN/COMMIT block, you're probably looking at about 10000 insertions per second in the presence of indexes. However, what you can do with the resulting database makes it all worthwhile. I did this with about 750000 files (85 GB). The whole insert and SHA1 hash operation took less than an hour, and it created a 140MB sqlite3 file. However, my query to find duplicate files and sort them by ID takes less than 20 seconds to run.

In summary, using a database is good, but note the insertion overhead. SHA1 is safer than MD5, but takes about 2.5x as much CPU power. However, I/O tends to be the bottleneck (CPU is a close second), so using MD5 instead of SHA1 really won't save you much time.

Joey Adams 2010-09-26 01:13:42

@joey how far along are you with your tool? I've been looking for a simple tool that does this for ages but couldn't find anything online beyond the obvious "compare two directories" shareware tools.

b20000 2010-09-26 17:53:15

My program is already capable of loading file tree information into a database and hashing files; it works fabulously. I'm currently working on the problem of replacing duplicate files with hardlinks. Note that my program will probably only work on Linux and other Unix-like systems because it's tied to the stat structure filled in by the [`lstat()`](http://linux.die.net/man/2/lstat) function.

Joey Adams 2010-09-26 18:29:59

Also, it has absolutely no frontend yet; you would have to paste in the path you want to scan, and for more complicated operations, learn how to work with Haskell code.

Joey Adams 2010-09-26 18:36:29

ansaurus

tags:

views:

answers:

mysql / file hash question

related questions