Is there any simple way of generating (and checking) MD5 checksums of a list of files in Python? (I have a small program I'm working on, and I'd like to confirm the checksums of the files).
+1
A:
There is a way that's pretty memory inefficient.
import hashlib
[(fname, hashlib.md5(file(fname, 'r').read()).digest()) for fname in fnamelst]
This will give you a list of tuples, each tuple containing the name of its file and its hash.
I strongly question your use of MD5. You should be at least using SHA1. MD5 is known broken, and shouldn't be used for any purpose, even if you don't think your purpose is security sensitive.
Here is a way that is more complex, but not memory inefficient:
import hashlib
def hashfile(afile, hasher, blocksize=65536):
buf = afile.read(blocksize)
while len(buf) > 0:
hasher.update(buf)
buf = afile.read(blocksize)
return hasher.digest()
[(fname, hashfile(file(fname, 'r'), hashlib.md5()) for fname in fnamelst]
Omnifarious
2010-08-07 19:53:25
I'm only using MD5 to confirm the file isn't corrupted. I'm not so concerned about it being broken.
Alexander
2010-08-07 20:03:30
@TheLifelessOne: And despite @Omnifarious scary warnings, that is perfectly good use of MD5.
GregS
2010-08-07 20:09:02
@GregS, @TheLifelessOne - Yeah, and next thing you know someone finds a way to use this fact about your application to cause a file to be accepted as uncorrupted when it isn't the file you're expecting at all. No, I stand by my scary warnings. I think MD5 should be removed or come with deprecation warnings.
Omnifarious
2010-08-07 20:21:32
+4
A:
You can use hashlib.md5()
Note that sometimes you won't be able to fit the whole file in memory. In that case, you'll have to read chunks of 128 bytes sequentially and feed them to the Md5 function. See this question.
quantumSoup
2010-08-07 19:53:52
Well if any of the files are larger than 1MB, then I've got some problems. Thanks though. I think that solves my problem.
Alexander
2010-08-07 19:59:09