views:

59

answers:

3

Thanks in advance for any assistance. I'm not even sure if this is possible, but I'm trying to get a list of duplicate files using their hashes to identify the list of files associated with the hashes.

I have this below:

Dictionary<FileHash, string[]> FindDuplicateFiles(string searchFolder)
{
    Directory.GetFiles(searchFolder, "*.*")
        .Select(
            f => new
                     {
                         FileName = f,
                         FileHash = Encoding.UTF8.GetString(new SHA1Managed()
                                                                .ComputeHash(new FileStream(f,
                                                                                            FileMode.
                                                                                                OpenOrCreate,
                                                                                            FileAccess.Read)))
                     })
        .GroupBy(f => f.FileHash)
        .Select(g => new
                         {
                             FileHash = g.Key,
                             Files = g.Select(z => z.FileName).ToList()
                         })
        .GroupBy(f => f.FileHash)
        .Select(g => new {FileHash = g.Key, Files = g.Select(z => z.Files).ToArray()});

It compiles fine, but I'm just curious whether there's even a way to manipulate the results to return a Dictionary.

Any suggestions, alternatives, critiques would be greatly appreciated.

A: 

Create an extension method to IEnumerable<_> called toDictionary which converts a sequence of key value pairs to dictionary. Might raise exception on duplicate keys.

Why do you need the second GroupBy?

Stefan
A: 

You can use Enumerable.ToDictionary to collect a LINQ query into a dictionary:

var sha1 = new SHA1Managed();

Dictionary<string, string[]> result =
    Directory
        .EnumerateFiles(searchFolder)
        .GroupBy(file => Convert.ToBase64String(sha1.ComputeHash(...)))
        .ToDictionary(g => g.Key, g => g.ToArray());

Some remarks:

  • Don't assume that a random byte sequence (such as a SHA-1 hash) is a valid UTF-8 string.
  • Consider using Directory.EnumerateFiles instead of Directory.GetFiles.
  • Don't forget to close the FileStream after computing the SHA-1 hash.
  • Afaik it's possible to reuse a SHA1Managed, so you don't need to create a new one for each file.
dtb
A: 

There's already an extension method which will do this. Just stick this at the end of your existing query:

.ToDictionary(x => x.FileHash, x => x.Files);

However: using Encoding.UTF8.GetString to convert arbitrary binary data into a string is a really bad idea. Use Convert.ToBase64String instead. The hash is not a UTF-8 encoded string, so don't treat it as one.

You're also grouping by hash twice, which I suspect isn't really what you want to do.

Alternatively, remove the previous GroupBy calls and use a Lookup instead:

var query = Directory.GetFiles(searchFolder, "*.*")
                     .Select(f => new {
                         FileName = f,
                         FileHash = Convert.ToBase64String(
                             new SHA1Managed().ComputeHash(...))
                        })
                     .ToLookup(x => x.FileHash, x => x.FileName);

That will give you a Lookup<string, string>, which is basically the files grouped by hash.

One further thing to note: I suspect you'll be leaving file streams open with this method. I suggest you write a small separate method to compute the hash of a file based on its name, but making sure you close the stream (with a using statement in the normal way). This will also end up making your query simpler - something along the lines of:

var query = Directory.GetFiles(searchFolder)
                     .ToLookup(x => ComputeHash(x));

It's hard to simplify it much further than that :)

Jon Skeet
Yes, this seems like a much better approach. A lot cleaner and easier for someone reading it to figure out what I'm tryng to do. I think I also need to read up a little bit on hashing algorithm do's and dont's. Thanks again for the assistance.
Nate Greenwood