the fastest way to create checksum for large files in python

views:

1216

answers:

+1 Q:

the fastest way to create checksum for large files in python

hi,

i need to transfer large files across network and need to create checksum for them on hourly basis. so the speed for generating checksum is critical for me.

somehow i can't make zlib.crc32 and zlib.adler32 working with files larger than 4GB on Windows XP Pro 64bit machine. i suspect i've hit the 32bit limitation here? using hashlib.md5 i could get a result but the problem is the speed. it takes roughly about 5 minutes to generate an md5 for 4.8GB file. task manager shows that the process is using one core only.

my questions are:

is there a way to make crc works on large file? i prefer to use crc than md5
if not then is there a way to speed up the md5.hexdigest()/md5.digest? or in this case any hashlib hexdigest/digest? maybe spliting it into multi thread process? how do i do that?

many thanks guys,

Pietra Arumaga

PS: i'm working on somethimg similar like an "Asset Management" system, kind of like svn but the asset consist of large compressed image files. the files have tiny bit incremental changes. the hashing/checksum is needed for detecting changes and error detection.

You cannot possibly use more than one core to calculate MD5 hash of a large file because of the very nature of MD5: it expects a message to be broken up in chunks and fed into hashing function in strict sequence. However, you can use one thread to read a file into internal queue, and then calculate hash in a separate thread so that. I do not think though that this will give you any significant performance boost.

The fact that it takes so long to process a big file might be due to "unbuffered" reads. Try reading, say, 16 Kb at a time and then feed the content in chunks to hashing function.

Anton Gogolev 2009-10-07 16:36:30

thanks for the reply Anton. i use f.read(1048576) and update the haslib.md5() for every read. yes i guess creating another thread for calculating the hash won't give that much of performance boost

pixelblender 2009-10-08 02:30:18

md5 itself can't be run in parallel. However you can md5 the file in sections (in parallel) and the take an md5 of the list of hashes.

However that assumes that the hashing is not IO-limited, which I would suspect it is. As Anton Gogolev suggests - make sure that you're reading the file efficiently (in large power-of-2 chunks). Once you've done that, make sure the file isn't fragmented.

Also a hash such as sha256 should be selected rather than md5 for new projects.

Are the zlib checksums much faster than md5 for 4Gb files?

Douglas Leeder 2009-10-07 16:39:25

SHA256 would be much slower than MD5, and there is no need for it. Yes, there has been a successful attack to engineer collisions with MD5, but this application is not trying to be cryptographically secure. He's using the hash as an optimization to prevent needless copying.

Stephen C. Steel 2009-10-07 17:45:39

thanks for the reply Douglas. i think sha256 is a bit too much for me and collision is not really a concern for me.

pixelblender 2009-10-08 02:31:46

Did you try the crc-generator module?

Brian 2009-10-07 16:43:18

thanks for the link, i'll take a look at it rightaway.

pixelblender 2009-10-08 02:18:11

+2 A:

It's an algorithm selection problem, rather than a library/language selection problem!

There appears to be two points to consider primarily:

how much would the disk I/O affect the overall performance?
what is the expected reliability of the error detection feature?

Apparently, the answer to the second question is something like 'some false negative allowed' since the reliability of any 32 bits hash, relative to a 4Gb message, even in a moderately noisy channel, is not going to be virtually absolute.

Assuming that I/O can be improved through multithreading, we may choose a hash that doesn't require a sequential scan of the complete message. Instead we can maybe work the file in parallel, hashing individual sections and either combining the hash values or appending them, to form a longer, more reliable error detection device.

The next step could be to formalize this handling of files as ordered sections, and to transmit them as such (to be re-glued together at the recipient's end). This approach, along additional information about the way the files are produced (for ex. they may be exclusively modified by append, like log files), may even allow to limit the amount of hash calculation required. The added complexity of this approach needs to weighted against the desire to have zippy fast CRC calculation.

Side note: Alder32 is not limited to message sizes below a particular threshold. It may just be a limit of the zlib API. (BTW, the reference I found about zlib.adler32 used a buffer, and well... this approach is to be avoided in the context of our huge messages, in favor of streamed processes: read a little from file, calculate, repeat..)

mjv 2009-10-07 17:24:08

hi mjv, thanks for your reply. so i guess i should create checksum on several parts of the file and combined them?

pixelblender 2009-10-08 02:26:28

@pixelblender Yes, provided that I/O are not a bottle neck, a multi-threaded implementation which would process say 100 Mb bytes "slices" of the file, in parallel fashion can be expected to be faster overall than a single threaded approach. You'll need to experiment to determine the optimal number of threads (there always comes a point where adding thread do not result in improving performance). The ordered list of CRCs from the individual "slices" of the can either be CRC-ed itself, or, preferabbly the CRCs can be appended to form a longer key, offering better error detection.

mjv 2009-10-08 02:49:41

+1 A:

First, there is nothing inherent in any of the CRC algorithms that would prevent them working on an arbitrary length of data (however, a particular implementation might well impose a limit).

However, in a file syncing application, that probably doesn't matter, as you may not want to hash the entire file when it gets large, just chunks anyway. If you hash the entire file, and the hashes at each end differ, you have to copy the entire file. If you hash fixed sized chunks, then you only have to copy the chunks whose hash has changed. If most of the changes to the files are localized (e.g. database) then this will likely require much less copying (and it' easier to spread per chunk calculations across multiple cores).

As for the hash algorithm itself, the basic tradeoff is speed vs. lack of collisions (two different data chunks yielding the same hash). CRC-32 is fast, but with only 2^32 unique values, collisions may be seen. MD5 is much slower, but has 2^128 unique values, so collisions will almost never be seen (but are still theoretically possible). The larger hashes (SHA1, SHA256, ...) have even more unique values, but are slower still: I doubt you need them: you're worried about accidental collisions, unlike digital signature applications, where you're worried about deliberately (malicously) engineered collisions.

It sounds like you're trying to do something very similar to what the rsync utility does. Can you just use rsync?

Stephen C. Steel 2009-10-07 18:02:08

hi Stephen, thanks for your reply. yes, collisions is not a concern for me that's why i prefer to use crc32. i've edited my post regarding on what i'm trying to accomplish with checksum.

pixelblender 2009-10-08 02:24:58

Even if you can't find a suitable Python implementation of the CRC32 algorithm, you should be able to adapt an implementation published in any language.You could even take advantage of Python's capabilites to link to native code libraries. This might even help the speed (but your performance is probably limited by disk I/O anyway with CRC-32).The CRC algorithms are fairly simple. I've implemented CRC-8 and CRC-16 in a few lines of C and a static data table. I don't remember implementing CRC-32, but I'm pretty sure it isn't much more complicated.

Stephen C. Steel 2009-10-08 15:38:33

You might be hitting a size limit for files in XP. The 64-bit gives you more addressing space (removing the 2GB (or so) addressing space per application), but probably does nothing for the file size problem.

Calyth 2009-10-08 23:02:25

ansaurus

tags:

views:

answers:

the fastest way to create checksum for large files in python

related questions