views:

1146

answers:

4

Is there any linux command line implementation that performs exceptionally well for generating sha1's on large files (< 2GB)?

I have played around with 'openssl sha1' and it takes minutes to get the sha1 for a 2GB file : /.

+1  A: 

sha1sum is what I'd use for computing SHA-1 checksums... it's designed to do exactly one thing so I would hope it does it as fast as practically possible. I don't have any 2GB files to benchmark it on though :-(

EDIT: After some tests on an ISO image it looks like the limiting factor on my system is disk I/O speed - not surprising, although I feel kind of silly for not thinking of that earlier. Once that's corrected for, it seems like openssl is about twice as fast as sha1sum...

David Zaslavsky
+2  A: 

Your problem is likely disk I/O. A basic SHA1 implementation on an old 2.0GHz Core Duo processor can process /dev/zero at 100MiB/s - faster than most hard drives typically paired with such a system.

Show us the speeds you're currently seeing (and on what spec hardware).

Andrew Medico
I'm doing this on a Macbook Pro, Late 2007 (2.4 Ghz Intel Core 2 Duo, 4 GB 667 Mhz DDR2) . The actual app will run on Amazon's c1.medium EC2 instances (dual core as well).I can use tmpfs for files up to 1.7GB. So until then I'd still like to use the fastest algorithm ;).
felixge
Try compiling for 64-bit if you can. The algorithm from GNU coreutils I use in a C application gets 167M/s on a 2.4GHz C2D and 271M/s on a 3.0GHz Xeon in 64-bit mode.
Andrew Medico
+4  A: 

On my machine, for a file of 1GB, with enough memory to have the entire file cached in memory after the first run:

sha1sum: 3.92s
openssl sha1: 3.48s
python hashlib.sha1: 3.22s

it takes minutes to get the sha1 for a 2GB file

There's something wrong there then, unless you're using incredibly slow old hardware. Even on the first run, where the file was being read directly from disc, it was only taking ‘openssl sha1’ about 20s per gig on my machine. Are you having slow I/O problems in general?

bobince
+1  A: 

I don't think that a SHA algorithm could be optimized for size, since it operates on blocks of a fixed size, and the computation cannot be done in parallel. It seems that the fastest implementation on a small file will also be the fastest on a large file.

erickson