ansaurus

Question

Answer 1

+3 A:

Hash calculation in your case will almost certanly be I/O bound (unless you'll be running it on a machine with a really slow processor), so multithreading or processing of multiple files at once probably won't yield you expected results.

Arraging files over multiple drives or on a faster (SSD) drive would probably help, even though that is probably not the solution you are looking for.

Mavrik 2010-05-11 19:11:22

Answer 2

+2 A:

Aren't disk operations a bottleneck here? Assuming 80MB/sec read speed (this is how my hard disk performs), it takes about 8 seconds to read the file.

Grzegorz Oledzki 2010-05-11 19:12:43

Answer 3

+2 A:

For what it's worth, doing this:

c:\python\Python.exe c:\python\Tools\scripts\md5sum.py cd.iso

takes 9.671 seconds on my laptop (2GHz core2 duo with an 80 GB SATA laptop hard drive).

As others have mentioned, MD5s are disk-bound, but your 12 second benchmark is probably pretty close to the fastest you could get.

Also, python's md5sum.py uses 8096 for the buffer size (even though I'm sure they meant either 4096 or 8192).

Seth 2010-05-11 19:25:47

Answer 4

A:

It helped me to increase my buffer size, up to a point. I started with 1024 and multiplied it with 2^N, increasing N each time starting from 1. With this method, I found that on my system that a buffer size of 65536 seemed to be about as good as it would get. However, it only gave me about an 7% improvement in running time.

Profiling indicated that about 80% of the time is spent in the MD5 update method and the other 20% is reading in the file. Since MD5 is a serial algorithm and the Python algorithm is already implemented in C, I don't think that there is much that you can do to speed up the MD5 part. You can try calculating the MD5s of two different files in parallel, but as everyone has said, you're ultimately going to be limited by the disk access speed.

Justin Peel 2010-05-11 20:33:25

ansaurus

tags:

views:

answers:

Python MD5 Hash Faster Calculation

related questions