I need to create ZIP archives on demand, using either Python zipfile module or unix command line utilities.
Resources to be zipped are often > 1GB and not necessarily compression-friendly.
How do I efficiently estimate its creation time / size?
I need to create ZIP archives on demand, using either Python zipfile module or unix command line utilities.
Resources to be zipped are often > 1GB and not necessarily compression-friendly.
How do I efficiently estimate its creation time / size?
I suggest you measure the average time it takes to produce a zip of a certain size. Then you calculate the estimate from that measure. However I think the estimate will be very rough in any case if you don't know how well the data compresses. If the data you want to compress had a very similar "profile" each time you could probably make better predictions.
Extract a bunch of small parts from the big file. Maybe 64 chunks of 64k each. Randomly selected.
Concatenate the data, compress it, measure the time and the compression ratio. Since you've randomly selected parts of the file chances are that you have compressed a representative subset of the data.
Now all you have to do is to estimate the time for the whole file based on the time of your test-data.
If you're using the ZipFile.write() method to write your files into the archive, you could do the following:
This won't work if you're only zipping one really big file though. I've never used the zip module myself, so I'm not sure if it would work, but for small numbers of large files, maybe you could use the ZipFile.writestr() function and read in / zip up your files in chunks?
If its possible to get progress callbacks from the python module i would suggest finding out how many bytes are processed pr second ( By simply storing where in the file you where at start of the second, and where you are at the end ). When you have the data on how fast the computer your on you can off course save it, and use it as a basis for your next zip file. ( I normally collect about 5 samples before showing a time prognosses )
Using this method can give you Microsoft minutes so as you get more samples you would need to average it out. This would esp be the case if your making a zip file that contains a lot of files, as ZIP tends to slow down when compressing many small files compared to 1 large file.