views:

92

answers:

4

I'm using the following code to do a checksum of a file which works fine. But when I generate a hash for a large file, say 2 GB, it is quite slow. How can I improve the performance of this code?

fs = new FileStream(txtFile.Text, FileMode.Open);
        formatted = string.Empty;
        using (SHA1Managed sha1 = new SHA1Managed())
        {
            byte[] hash = sha1.ComputeHash(fs);

            foreach (byte b in hash)
            {
                formatted += b.ToString("X2");
            }
        }
        fs.Close();

Update:

System:

OS: Win 7 64bit, CPU: I5 750, RAM: 4GB, HDD: 7200rpm

Tests:

Quite slow = 59.895 seconds

Test 2 = 59.94 seconds

+1  A: 

Well, is it IO-bound or CPU-bound? If it's CPU-bound, there's not a lot we can do about that.

It's possible that opening the FileStream with different parameters would allow the file system to do more buffering or assume that you're going to read the file sequentially - but I doubt that will help very much. (It's certainly not going to do a lot if it's CPU-bound.)

How slow is "quite slow" anyway? Compared with, say, copying the file?

If you have a lot of memory (e.g. 4GB or more) how long does it take to hash the file a second time, when it may be in the file system cache?

Jon Skeet
I've run some speed tests. Check my update. Also CPU usage only peaks at about 30%.
Bruce Adams
@Bruce: 30% in total? Out of how many cores? If it's a multi-core CPU but a single-threaded hashing algorithm, it could still be CPU-bound. Look at Task Manager's performance tab to see whether one CPU is pegged for the entire time :)
Jon Skeet
No, all 4 cores average at about 5 - 6%. 2 cores doing a bit of work but nowhere near pegged. Definitely IO-bound right?
Bruce Adams
@Bruce: Sounds that way, yes.
Jon Skeet
+1  A: 

First of all, have you measured "quite slow"? From this site, SHA-1 has about half the speed of MD5 with about 100 MB/s (depending on the CPU), so 2 GB would take about 20 seconds to hash. Also, note that if you're using a slow HDD, this might be your real bottleneck as 30-70 MB/s aren't unusual.

To speed up things, you might just not hash the whole file, but the first X KB or representable parts of it (the parts that will most likely differ). If your files aren't too similar, this shouldn't cause duplicates.

schnaader
+2  A: 

The first question is what you need this checksum for. If you don't need the cryptographic properties, then a non-cryptographic hash, or a hash that is less cryptographically secure (MD5 being "broken" doesn't prevent it being a good hash, nor still strong enough for some uses) is likely to be more performant. You could make your own hash by reading a subset of the data (I'd advise making this subset work in 4096byte chunks of the underlying file, as that would match the buffer size used by SHA1Managed as well as allowing for a faster chunk read than you would if you did say every X bytes for some value of X).

Of course, if you want the SHA-1 of the file to interoperate with something else, you are stuck.

I would experiment with different buffer sizes, as increasing the size of the filestream's buffer can increase speed at the cost of extra memory. I would advise a whole multiple of 4096 (4096 is the default, incidentally) as SHA1Managed will ask for 4096 chunks at a time, and this way there'll be no case where either FileStream returns less than the most asked for (allowed but sometimes suboptimal) or does more than one copy at a time.

Jon Hanna
+1 for the first sequence. Sometimes we are solving the wrong problem altogether.
Petar Repac
Thanks. Decided to go with MD5 as I was only checking the integrity of the files after transmission and didn't require the extra security of SHA-1.Just out of curiosity. I found Intel’s new implementation of SHA-1 using SSE3 instructions. http://software.intel.com/en-us/articles/improving-the-performance-of-the-secure-hash-algorithm-1/ Just wondering if and how this can be used in managed code?
Bruce Adams
+1  A: 

First: SHA-1 file hashing should be I/O bound on non-ancient CPUs - and I5 certainly doesn't qualify as ancient. Of course it depends on the implementation of SHA-1, but I doubt SHA1Managed is über-slow.

Next, 60sec for 2GB data is ~34MB/s - that's slow for harddisk reads; even a 2.5" laptop disk can read faster than that. Assuming the harddrive is internal (no USB2/whatever or network bottleneck), and there's not a lot of other disk I/O activity, I'd be surprised to see less than 60MB/s read from a modern drive.

My guess would be that ComputeHash() uses a small buffer internally. Try manually reading/hashing, so you can specify a larger buffer (64kb or even larger) to increase throughput. You could also move to async processing so disk-read and compute can be overlapped.

snemarch