views:

5211

answers:

4

I have used hashlib (which replaces md5 in Python 2.6/3.0) and it worked fine if I opened a file and put its content in hashlib.md5 function.

The problem is with very big files that their sizes could exceed RAM size.

How to get a MD5 hash of a file without open it?

+2  A: 

u can't get it's md5 without read full content. but u can use update function to read the files content block by block.
m.update(a); m.update(b) is equivalent to m.update(a+b)

sunqiang
Thank you for help.
JustRegisterMe
+25  A: 

Break the file into 128-byte chunks and feed them to MD5 consecutively using update().

This takes advantage of the fact that MD5 has 128-byte digest blocks. Basically, when MD5 digest()s the file, this is exactly what it is doing.

If you make sure you free the memory on each iteration (i.e. not read the entire file to memory), this shall take no more than 128 bytes of memory.

One example is to read the chunks like so:

f = open(fileName)
while not endOfFile:
    f.read(128)
Yuval A
Thanks very much, that worked like a charm!
JustRegisterMe
Python is garbage-collected, so there's (usually) not really a need to worry about memory. Unless you explicitly keep around references to all the strings you read from the file, python will free and/or reuse as it sees fit.
Kjetil Jorgensen
@kjeitikor: If you read the entire file into e.g. a Python string, then Python won't have much of a choice. That's why "worrying" about memory makes total sense in this case, where the choice to read it in chunks must be made by the programmer.
unwind
You can just as effectively use a block size of any multiple of 128 (say 8192, 32768, etc.) and that will be much faster than reading 128 bytes at a time.
jmanning2k
Thanks jmanning2k for this important note, a test on 184MB file takes (0m9.230s, 0m2.547s, 0m2.429s) using (128, 8192, 32768), I will use 8192 as the higher value gives non-noticeable affect.
JustRegisterMe
+20  A: 

You need to read the file in chunks of suitable size:

def md5_for_file(f, block_size=2**20):
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.digest()
Lars Wirzenius
Thanks for this example.
JustRegisterMe
A: 

The only way I know of calculate an MD5 for a file without opening it is using this API http://www.filemd5.net

JohnnieWalker