ansaurus

Question

Suggestions for duplicate file finder algorithm (using C)

Answer 1

+1 A:

Because you're using pthreads, I assume you're working in a Unix environment -- in which case you could mmap(2) both files into memory and compare the memory arrays directly.

Steve Emmerson 2010-04-18 16:30:37

what if files are bigger than available contiguous chunks of address space?

atzz 2010-04-18 16:36:34

Answer 2

+5 A:

The limiting factor will be disk reads, which (assuming that both files are on the same disk) will be serialized anyway, so I don't think threading will help much at all.

Thomas Padron-McCarthy 2010-04-18 16:36:52

Answer 3

+3 A:

It's hard to guess about performance without a real system to test against (for example if you're using a solid state drive, there's no head seek time and the cost of reading different sectors from different threads is almost zero).

If this is running against a reasonably standard computer with regular (spinning platter) hard drives, having multiple threads contend for the part of the disk they want to read from will possibly slow things down (depending, again, on the hardware and also the size of the chunks).

If the time it takes to compute the "sameness" of a chunk is fast compared to the time it takes to read that chunk from disk, having a separate thread will not help much since the second (or third...) thread would spend most of it's time waiting for IO to complete anyway.

Another factor is the cache size of the CPU. If all of the memory you're processing at one time fits in the CPU cache, things will be much faster than if different threads cause different chunks of memory to be loaded into cache as they execute instructions.

If you have more threads than you have CPU cores, you will just slow things down by making unnecessary context switches (since a thread needs a core to run on).

After reading all of that, if you still think multithreading is going to help for your target system, consider one thread that does IO only, places the data in a queue, and has two or more worker threads taking data off of the queue to process. That way, you optimize disk IO and can take advantage of multiple cores to crunch the numbers.

Steve suggested you can memory map you files on Unix. That will speed up access to the underlying data a bit by leveraging low level OS functionality (the same kind used to manage swap files). That will give you some performance improvement as the OS will handle loading the parts of the file you are working on into memory efficiently, as long as the file fits into available address space. FYI you can do the same thing on Windows.

Eric J. 2010-04-18 16:38:58

Switching threads doesn't really require a context switch, does it?

Chris Cooper 2010-04-18 16:57:43

I like the queuing idea though.

Chris Cooper 2010-04-18 16:58:35

@Chris: Yes, but switching between threads is much lighter weight than switching between processes. It's still necessary for the OS to save the previous thread state, load the CPU with appropriate registers (e.g. instruction pointer) for the new thread. The CPU may need to remove some items from cache and load items into cache for the new thread, depending on whether all executing threads can fit their memory requests into the CPU cache. In a worst-case scenario, switching threads might even cause swapping (if they have allocated AND access much memory during an execution cycle).

Eric J. 2010-04-18 19:30:43

@Eric: That makes sense. Thanks.

Chris Cooper 2010-04-18 21:45:43

Answer 4

+1 A:

Well, there is the standard memory mapping mmap() function that maps a file to memory. You should be able to do something like

int fd1;
int fd2;
int size1;
int size2;

fd1 = open(name1, O_RDONLY);
size1 = lseek(fd1, 0, SEEK_END); 

fd2 = open(name2, O_RDONLY);
size2 = lseek(fd2, 0, SEEK_END);

if ( size1 == size2 )
{
   char * data1 = mmap(0, size1, PROT_READ, MAP_SHARED, fd1, 0);
   char * data2 = mmap(0, size1, PROT_READ, MAP_SHARED, fd2, 0);
   int i;

   /* ...and this is, obviously, where you'd do something more clever */
   for ( i = 0; i < size1 && *data1 == *data2; i++, data1++, data2++ );

   if ( i == size1 )
       printf("Equal\n");
}

close(fd1);
close(fd2);

Other than that, yes, your solution looks overly complicated ;-) The threaded approach is not necessarily flawed, but you might not see that parallel access improves performance. For SAN drives or ramdisks it might improve performance, for normal spinning platter drives it might impede it. But simpler is usually better, unless you really have a performance issue.

Regarding fseek() vs other methods, it depends on the operating system you use. Google is you friend here, you can easily find articles at least for Solaris and Linux.

Christoffer 2010-04-18 16:41:49

A good idea; however, this will likely lead to disc thrashing. You're comparing a byte a time, so what will happen is that the OS will alternate reading one sector from the first file and one sector from the second file, resulting in lots of disc seeks back and forth for each sector's worth of data.

Adam Rosenfield 2010-04-21 19:18:36

Yeah, hence the comment "this is obvously where you'd do something clever" :-)

Christoffer 2010-04-22 10:15:27

Answer 5

+4 A:

You could probably simplify your code greatly by using hashes, instead of doing a byte-by-byte comparison. Assuming you're not doing anything important, like deleting, an md5 or similar hash function should be plenty. Boost provides quite a few, and they're usually pretty fast.

if fileA.size == fileB.size
    if fileA.hash() == fileB.hash()
        flag(fileA, fileB, same);

I wouldn't delete files after that comparison, but it's plenty safe to move them to a temporary directory for further review or just build a list of possible duplicates.

peachykeen 2010-04-18 17:28:57

I am using C. Not interested in C++, still thanks for your suggestion.

Andrei Ciobanu 2010-04-18 18:37:22

If hashing would make things easier, there are probably C hash libraries around. I wanna say the GNU C lib has a crypto section, and there are undoubtedly others.

peachykeen 2010-04-18 20:19:31

+1 For the tip of using hashes. I will investigate more on this.

Andrei Ciobanu 2010-04-21 19:13:29

Hashes still have to be computed somewhere... which still involves looking at every byte and performing a calculation on that byte. If you store the hash for later use and have a mechanism (e.g. last updated timestamp) to ensure the hash has not changed, re-comparing the file later would be much faster.

Eric J. 2010-05-10 16:40:53

I said hashes could simplify the code, not necessarily make it faster. You can also read chunks, or the entire file, in your code and feed it to your hash function instead of manual byte-by-byte comparison (I'd certainly prefer that ;) ). If you are re-comparing files, for example a one-to-many comparison, hashing them will be quite a bit faster, though (and you won't need to do simultaneous comparisons for all the files).

peachykeen 2010-05-10 19:33:09

Answer 6

+1 A:

Even if disk access was not the limiting factor (it will be), unless you have a multi-core processor that could hand off different threads to different cores, you would not see a speed-up from going multi-threaded. Basically, you have to compare all N bytes of the file one way or another, and even if you use threads, if they execute in the same core, it will take the same amount of time as without using threads.

There are some environments that could spread the workload across cores, but even so, the CPU will be able to process so much faster than the data can be pulled in from disk that the disk I/O system will be the limiting factor.

JustJeff 2010-04-18 17:30:09

Answer 7

+2 A:

Before even considering the performance effects of parallel disk reads and thread overhead and such...

Is there any reason to believe that scanning the files in chunks will find the differences any quicker than straight through? Is the data contained in the files predominantly in a certain format, and if so, is the splitting scheme tailored to it? If not, I don't see how scanning the files by skipping over every n bytes (which is all the multithreaded splitting is effectively doing) could offer any improvement over reading the bytes in the order they are on disk.

Think of the two limiting cases -- "splitting" the file into one block, and splitting the file into as many one-byte "blocks" as there are bytes in the file. Will either of those cases be more efficient than the other, or some in-between value? If there is no in-between value that you know you should optimize to, then you know nothing about how the data is stored in the files, so it should make no difference how you scan them.

Even if you set the split to optimize to the disk's performance like block size, you're still going to have to go back to read the next byte, which will likely be at an extremely non-optimal position. And in the end you're going to have to read every single byte in the file, no matter how you split it.

Paul Richter 2010-04-18 18:10:38

Answer 8

A:

I see there's crap ones online that want $30, so I figured I'd look for a C# version I could compile myself: luckily someone made one using hashing back in 2008.

Code Project I found tonight about this very subject

Michael Adams 2010-05-12 03:40:53

ansaurus

tags:

views:

answers:

Suggestions for duplicate file finder algorithm (using C)

related questions