Hi, I'd like to write a script that traverses a file tree, calculates a hash for each file, and inserts the hash into an SQL table together with the file path, such that I can then query and search for files that are identical. What would be the recommended hash function or command like tool to create hashes that are extremely unlikely to be identical for different files? Thanks B
you can use md5 hash or sha1
function process_dir($path) {
if ($handle = opendir($path)) {
while (false !== ($file = readdir($handle))) {
if ($file != "." && $file != "..") {
if (is_dir($path . "/" . $file)) {
process_dir($path . "/" . $file);
} else {
//you can change md5 to sh1
// you can put that hash into database
$hash = md5(file_get_contents($path . "/" . $file));
}
}
}
closedir($handle);
}
}
if you working in Windows change slashes to backslashes.
I've been working on this problem for much too long. I'm on my third (and hopefully final) rewrite.
Generally speaking, I recommend SHA1 because it has no known collisions (whereas MD5 collisions can be found in minutes), and SHA1 doesn't tend to be a bottleneck when working with hard disks. If you're obsessed with getting your program to run fast in the presence of a solid-state drive, either go with MD5, or waste days and days of your time figuring out how to parallelize the operation. In any case, do not parallelize hashing until your program does everything you need it to do.
Also, I recommend using sqlite3. When I made my program store file hashes in a PostgreSQL database, the database insertions were a real bottleneck. Granted, I could have tried using COPY (I forget if I did or not), and I'm guessing that would have been reasonably fast.
If you use sqlite3 and perform the insertions in a BEGIN
/COMMIT
block, you're probably looking at about 10000 insertions per second in the presence of indexes. However, what you can do with the resulting database makes it all worthwhile. I did this with about 750000 files (85 GB). The whole insert and SHA1 hash operation took less than an hour, and it created a 140MB sqlite3 file. However, my query to find duplicate files and sort them by ID takes less than 20 seconds to run.
In summary, using a database is good, but note the insertion overhead. SHA1 is safer than MD5, but takes about 2.5x as much CPU power. However, I/O tends to be the bottleneck (CPU is a close second), so using MD5 instead of SHA1 really won't save you much time.