ansaurus

Question

Is there any way I can speed up the opening and hashing of 15,000 small files in C#?

Answer 1

+4 A:

You are creating and destroying the SHA1Managed class for EVERY file; this is horrifically inefficient. Create it once, and call ComputeHash 15,000 times instead and you'll get a huge performance increase (IMO.)

public Dictionary<string,string> GetChecksums(string[] filePaths)
{ 
    var checksums = new Dictionary<string,string>(filePaths.length);

    using (SHA1Managed sha1 = new SHA1Managed()) 
    { 
         foreach (string filePath in filePaths) {
              using (var fs = File.OpenRead(filePath)) {
                  checksums.Add(filePath, BitConverter.ToString(sha1.ComputeHash(fs)));
              }
         }         
    }
    return checksums;
}

The SHA1Managed class is particularly slow to create/destroy because it calls out to p/invoke native win32 classes.

-Oisin

x0n 2010-01-04 02:43:26

Wow, darn simple! I wish I saw that.

Hamish Grubijan 2010-01-04 02:44:49

I have changed this, and it didn't effect the final performance at all. I don't think this was the bottleneck. Thanks for the tip though.

Firesteel 2010-01-04 02:48:52

i refuse to believe it did not affect the performance at all - have you got any metrics to back this up? it might not be a large increase, but it should be something?

x0n 2010-01-04 02:50:59

@x0n, I'm not sure how to show you. I could post the screenshot of the stopwatch times before and after the change. They are nearly identical.

Firesteel 2010-01-04 02:53:30

I'm sure it would make a difference if I wasn't IO bound, but I'm starting to think that is the case.

Firesteel 2010-01-04 02:54:13

ok, i'll take your word for it. What's the timing for just reading each file into a byte array and discarding it (without computing any hashes) ?

x0n 2010-01-04 02:55:16

@x0n, good question. I'll run that now.

Firesteel 2010-01-04 02:57:18

I'm not surprised it doesn't make a difference. The cost of the hashing is the huge part, though for small files the instantiation cost might be significant.

GregS 2010-01-04 02:58:17

Answer 2

+1 A:

Use a "ramdisk" - build a file system in memory.

Hamish Grubijan 2010-01-04 02:43:34

Answer 3

+1 A:

You didn't say whether your operation is CPU bound, or IO bound.

With a hash, I would suspect it is CPU bound. If it is CPU bound, you will see the CPU saturated (100% utilized) during the computation of the SHA hashes. If it is IO bound, the CPU will not be saturated.

If it is CPU bound, and you have a multi-CPU or multi-core machine (true for most laptops built in the last 2 years, and almost all servers built since 2002), then you can get an instant increase by using multiple threads, and multiple Sha1Managed() instances, and computing the SHA's in parallel. If it's a dual-core machine - 2x. If it's a dual-core 2-cpu machine (typical server) you'll get 4x throughput.

By the way, when a single-threaded program like yours "saturates" the CPU on a dual-core machine, will show up as 50% utilization in Windows Task Manager.

You need to manage the workflow through the threads, to keep track of which thread is working on which file. But this isn't hard to do.

Cheeso 2010-01-04 02:59:57

Answer 4

+1 A:

Profile it first.

Try dotTrace: http://www.jetbrains.com/profiler/

deerchao 2010-01-04 03:11:38

Answer 5

A:

Have you tried using the SHA1CryptoServiceProvider class instead of SHA1Managed? SHA1CryptoServiceProvider is implemented in native code, not managed code, and was much quicker in my experience. For example:

public static byte[] CreateSHA1Hash(string filePath)
{
    byte[] hash = null;



    using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
    {
        using(FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 131072))
        {
            hash = sha1.ComputeHash(fs);
        }

        //hash = sha1.ComputeHash(File.OpenRead(filePath));
    }

    return hash;
}

Also, with 15000 files I would use a file enumerator approach (ie WinAPI: FindFirstFile(), FindNextFile()) rather than the standard .NET Directory.GetFiles().

Directory.GetFiles loads all file paths into memory in one go. This is often much slower than enumerating files directory by directory using the WinAPI functions.

Ash 2010-01-04 03:14:19

ansaurus

tags:

views:

answers:

Is there any way I can speed up the opening and hashing of 15,000 small files in C#?

related questions