views:

396

answers:

4

Hi everyone. I will try my best to explain my problem and my line of thought on how I think I can solve it.

I use this code

    for root, dirs, files in os.walk(downloaddir):
for infile in files:
    f = open(os.path.join(root,infile),'rb')
    filehash = hashlib.md5()
    while True:
        data = f.read(10240)
        if len(data) == 0:
            break
        filehash.update(data)
    print "FILENAME: " , infile
    print "FILE HASH: " , filehash.hexdigest()

and using start = time.time() elapsed = time.time() - start I measure how long it takes to calculate an hash. Pointing my code to a file with 653megs this is the result:

root@Mars:/home/tiago# python algorithm-timer.py 
FILENAME:  freebsd.iso
FILE HASH:  ace0afedfa7c6e0ad12c77b6652b02ab
          12.624
root@Mars:/home/tiago# python algorithm-timer.py 
FILENAME:  freebsd.iso
FILE HASH:  ace0afedfa7c6e0ad12c77b6652b02ab
          12.373
root@Mars:/home/tiago# python algorithm-timer.py 
FILENAME:  freebsd.iso
FILE HASH:  ace0afedfa7c6e0ad12c77b6652b02ab
          12.540

Ok now 12 seconds +- on a 653mb file, my problem is I intend to use this code on a program that will run through multiple files, some of them might be 4/5/6Gb and it will take wayy longer to calculate. What am wondering is if there is a faster way for me to calculate the hash of the file? Maybe by doing some multithreading? I used a another script to check the use of the CPU second by second and I see that my code is only using 1 out of my 2 CPUs and only at 25% max, any way I can change this?

Thank you all in advance for the given help.

+3  A: 

Hash calculation in your case will almost certanly be I/O bound (unless you'll be running it on a machine with a really slow processor), so multithreading or processing of multiple files at once probably won't yield you expected results.

Arraging files over multiple drives or on a faster (SSD) drive would probably help, even though that is probably not the solution you are looking for.

Mavrik
+2  A: 

Aren't disk operations a bottleneck here? Assuming 80MB/sec read speed (this is how my hard disk performs), it takes about 8 seconds to read the file.

Grzegorz Oledzki
+2  A: 

For what it's worth, doing this:

c:\python\Python.exe c:\python\Tools\scripts\md5sum.py cd.iso

takes 9.671 seconds on my laptop (2GHz core2 duo with an 80 GB SATA laptop hard drive).

As others have mentioned, MD5s are disk-bound, but your 12 second benchmark is probably pretty close to the fastest you could get.

Also, python's md5sum.py uses 8096 for the buffer size (even though I'm sure they meant either 4096 or 8192).

Seth
A: 

It helped me to increase my buffer size, up to a point. I started with 1024 and multiplied it with 2^N, increasing N each time starting from 1. With this method, I found that on my system that a buffer size of 65536 seemed to be about as good as it would get. However, it only gave me about an 7% improvement in running time.

Profiling indicated that about 80% of the time is spent in the MD5 update method and the other 20% is reading in the file. Since MD5 is a serial algorithm and the Python algorithm is already implemented in C, I don't think that there is much that you can do to speed up the MD5 part. You can try calculating the MD5s of two different files in parallel, but as everyone has said, you're ultimately going to be limited by the disk access speed.

Justin Peel