ansaurus

Question

C#, Fastest (Best?) Method of Identifying Duplicate Files in an Array of Directories

Answer 1

+2 A:

Your approach sounds sane to me. Unless you have very good reason to assume that it will not suffice your performance requirements, I'd simply implement it this way and optimize it later if necessary. Remember that "premature optimization is the root of evil".

Adrian Grigore 2010-05-11 22:52:06

Answer 2

+2 A:

you could use LINQ:

http://www.hosca.com/blog/post/2010/04/13/using-LINQ-to-detect-and-remove-duplicate-files.aspx

Richard 2010-05-11 22:55:30

ha thats pretty clever!

Francisco Noriega 2010-05-11 23:21:48

That is pretty neat. I'm definitely going to use some of that for my purposes.

Nate Greenwood 2010-05-11 23:38:25

Answer 3

+9 A:

You could avoid hashing by first comparing file sizes. If you never find files with the same sizes you don't have to hash them. You only hash a file once you find another file with the same size, then you hash them both.

That should be significantly faster than blindly hashing every single file, although it'd be more complicated to implement that two-tiered check.

John Kugelman 2010-05-11 23:01:27

Honestly, using proper encapsulation and class design, that wouldn't add very much complexity I think.

tster 2010-05-11 23:22:40

Answer 4

A:

For a byte-comparison where you're expecting many duplicates, then you're likely best off with the method you're already looking at.

If you're really concerned about efficiency and know that duplicates will always have the same filename, then you could start by comparing filenames alone and only hash bytes when you find a duplicate name. That way you'd save the time of hashing files that have no duplicate in the tree.

Ragoczy 2010-05-11 23:05:00

Answer 5

A:

I'd suggest keeping multiple in-memory indexes of files.

Create one that indexes all files by file length:

Dictionary<int, List<FileInfo>> IndexBySize;

When you're processing new file Fu, it's a quick lookup to find all other files that are the same size.

Create another that indexes all files by modification timestamp:

Dictionary<DateTime, List<FileInfo>> IndexByModification;

Given file Fu, you can find all files modified at the same time.

Repeat for each signficiant file characteristic. You can then use the Intersect() extension method to compare multiple criteria efficiently.

For example:

var matchingFiles
    = IndexBySize[fu.Size].Intersect(IndexByModification[fu.Modified]);

This would allow you to avoid the byte-by-byte scan until you need to. Then, for files that have been hashed, create another index:

Dictionary<MD5Hash, List<FileInfo>> IndexByHash;

You might want to calculate multiple hashes at the same time to reduce collisions.

Bevan 2010-05-11 23:16:58

ansaurus

tags:

views:

answers:

C#, Fastest (Best?) Method of Identifying Duplicate Files in an Array of Directories

related questions