ansaurus

Question

What the best approach to iterate and "store" files over a directory in C (Linux) ?

Answer 1

+3 A:

You probably want to use a hash or checksum; create a hash of the content of each file, and associate each hash with a list of filenames with that hash. Then, when you construct a hash, you will see what other files also share the same hash. Only if you get a collision will you actually have to compare the full files. If you choose a good hash function, then collisions should be relatively rare, making full comparisons rare.

Note that there is a tradeoff between the size of the hash and the number of collisions; if you use a smaller hash, the collisions will probably be more frequent, but you will use less space per file, and if you use a larger hash, then you will have to do fewer full file comparisons, but will need to hold onto and compare larger hashes. Also, some hash functions result in fewer collisions than other hashes, but it is possible that some better hash functions may be more time consuming and compute intensive than poorer ones.

An efficient method of file and directory traversal is to use ftw or nftw.

Michael Aaron Safyan 2010-04-23 08:20:04

In my opinion the hashing is not necessary ? Still the problem is how do I store the filenames. In what type of datastructure ?

Andrei Ciobanu 2010-04-23 08:23:51

@Andrei, you would use a hash table, mapping the hash of the file to a linked list of strings, where each string contains the path of a file.

Michael Aaron Safyan 2010-04-23 08:29:18

Without any sort of hash, how would you check for duplicates ? Compare the entire file content to the entire file content of all the other files in the directory, for each file ? Ofcourse, you could check first if the file sizes are equal as a first case - (though I've a directory here with 300k+ files of the same size also)That's going to be very slow compared to hashing all the files once, and comparing the content once if there is a collision.

nos 2010-04-23 08:41:01

+1, it would be very easy to implement a simple hash in the f/nftw() callback. Just don't believe older documentation that describes ftw/nftw as using a breadth-first algorithm. I've encountered man pages that do.

Tim Post 2010-04-23 08:42:16

ansaurus

tags:

views:

answers:

What the best approach to iterate and "store" files over a directory in C (Linux) ?

related questions