tags:

views:

246

answers:

5

I'm working on SHA1 checksum hashing 15,000 images (40KB - 1.0MB each, approximately 1.8GB total). I'd like to speed this up as it is going to be a key operation in my program and right now it is taking between 500-600 seconds.

I've tried the following which took 500 seconds:

 public string GetChecksum(string filePath)
        {
            FileStream fs = new FileStream(filePath, FileMode.Open);
            using (SHA1Managed sha1 = new SHA1Managed())
            {
                return BitConverter.ToString(sha1.ComputeHash(fs));
            }

        }

Then I thought maybe the chunks SHA1Managed() was reading in were too small so I used a BufferedReader and increased the buffer size to greater than the size of any of the files I'm reading in.

 public string GetChecksum(string filePath)
        {
            using (var bs = new BufferedStream(File.OpenRead(filePath), 1200000))
            {
                using (SHA1Managed sha1 = new SHA1Managed())
                {
                    return BitConverter.ToString(sha1.ComputeHash(bs));
                }
            }
        }

This actually took 600 seconds.

Is there anything I can do to speed up these IO operations, or am I stuck with what I got?


As per x0n's suggestion I tried just reading in each file into a byte array and discarding the result. It appears I'm IO bound as this took ~480 seconds in itself.

+4  A: 

You are creating and destroying the SHA1Managed class for EVERY file; this is horrifically inefficient. Create it once, and call ComputeHash 15,000 times instead and you'll get a huge performance increase (IMO.)

public Dictionary<string,string> GetChecksums(string[] filePaths)
{ 
    var checksums = new Dictionary<string,string>(filePaths.length);

    using (SHA1Managed sha1 = new SHA1Managed()) 
    { 
         foreach (string filePath in filePaths) {
              using (var fs = File.OpenRead(filePath)) {
                  checksums.Add(filePath, BitConverter.ToString(sha1.ComputeHash(fs)));
              }
         }         
    }
    return checksums;
}

The SHA1Managed class is particularly slow to create/destroy because it calls out to p/invoke native win32 classes.

-Oisin

x0n
Wow, darn simple! I wish I saw that.
Hamish Grubijan
I have changed this, and it didn't effect the final performance at all. I don't think this was the bottleneck. Thanks for the tip though.
Firesteel
i refuse to believe it did not affect the performance at all - have you got any metrics to back this up? it might not be a large increase, but it should be something?
x0n
@x0n, I'm not sure how to show you. I could post the screenshot of the stopwatch times before and after the change. They are nearly identical.
Firesteel
I'm sure it would make a difference if I wasn't IO bound, but I'm starting to think that is the case.
Firesteel
ok, i'll take your word for it. What's the timing for just reading each file into a byte array and discarding it (without computing any hashes) ?
x0n
@x0n, good question. I'll run that now.
Firesteel
I'm not surprised it doesn't make a difference. The cost of the hashing is the huge part, though for small files the instantiation cost might be significant.
GregS
+1  A: 

Use a "ramdisk" - build a file system in memory.

Hamish Grubijan
+1  A: 

You didn't say whether your operation is CPU bound, or IO bound.

With a hash, I would suspect it is CPU bound. If it is CPU bound, you will see the CPU saturated (100% utilized) during the computation of the SHA hashes. If it is IO bound, the CPU will not be saturated.

If it is CPU bound, and you have a multi-CPU or multi-core machine (true for most laptops built in the last 2 years, and almost all servers built since 2002), then you can get an instant increase by using multiple threads, and multiple Sha1Managed() instances, and computing the SHA's in parallel. If it's a dual-core machine - 2x. If it's a dual-core 2-cpu machine (typical server) you'll get 4x throughput.

By the way, when a single-threaded program like yours "saturates" the CPU on a dual-core machine, will show up as 50% utilization in Windows Task Manager.

You need to manage the workflow through the threads, to keep track of which thread is working on which file. But this isn't hard to do.

Cheeso
+1  A: 

Profile it first.

Try dotTrace: http://www.jetbrains.com/profiler/

deerchao
A: 

Have you tried using the SHA1CryptoServiceProvider class instead of SHA1Managed? SHA1CryptoServiceProvider is implemented in native code, not managed code, and was much quicker in my experience. For example:

public static byte[] CreateSHA1Hash(string filePath)
{
    byte[] hash = null;



    using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
    {
        using(FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 131072))
        {
            hash = sha1.ComputeHash(fs);
        }

        //hash = sha1.ComputeHash(File.OpenRead(filePath));
    }

    return hash;
}

Also, with 15000 files I would use a file enumerator approach (ie WinAPI: FindFirstFile(), FindNextFile()) rather than the standard .NET Directory.GetFiles().

Directory.GetFiles loads all file paths into memory in one go. This is often much slower than enumerating files directory by directory using the WinAPI functions.

Ash