I'm archiving data to DVD, and I want to pack the DVDs full. I know the names and sizes of all the files I want on the DVD, but I don't know how much space is taken up by metadata. I want to get as many files as possible onto each DVD, so I'm using a Bubblesearch heuristic with greedy bin-packing. I try 10,000 alternatives and get the best one. Currently I know the sizes of all the files and because I don't know how files are stored in an ISO 9660 filesystem, I add a lot of slop for metadata. I'd like to cut down the slop.
I could use genisoimage -print-size
except it is too slow---given 40,000 files occupying 500MB, it takes about 3 seconds. Taking 8 hours per DVD is not in the cards. I've modified the genisoimage
source before and am really not keen to try to squeeze the algorithm out of the source code; I am hoping someone knows a better way to get an estimate or can point me to a helpful specification.
Clarifying the problem and the question:
I need to burn archives that split across multiple DVDs, typically around five at a time. The problem I'm trying to solve is to decide which files to put on each DVD, so that each DVD (except the last) is as full as possible. This problem is NP-hard.
I'm using the standard greedy packing algorithm where you place the largest file first and you put it in the first DVD having sufficient room. So j_random_hacker, I am definitely not starting from random. I start from sorted and use Bubblesearch to perturb the order in which the files are packed. This procedure improves my packing from around 80% of estimated capacity to over 99.5% of estimated capacity. This question is about doing a better job of estimating the capacity; currently my estimated capacity is lower than real capacity.
I have written a program that tries 10,000 perturbations, each of which involves two steps:
- Choose a set of files
- Estimate how much space those files will take on DVD
Step 2 is the step I'm trying to improve. At present I am "erring on the side of caution" as Tyler D suggests. But I'd like to do better. I can't afford to use
genisomage -print-size
because it's too slow. Similarly, I can't tar the files to disk, because on only is it too slow, but a tar file is not the same size as an ISO 9660 image. It's the size of the ISO 9660 image I need to predict. In principle this could be done with complete accuracy, but I don't know how to do it. That's the question.
Note: these files are on a machine with 3TB of hard-drive storage. In all cases the average size of the files is at least 10MB; sometimes it is significantly larger. So it is possible that genisomage
will be fast enough after all, but I doubt it---it appears to work by writing the ISO image to /dev/null, and I can't imagine that will be fast enough when the image size approaches 4.7GB. I don't have access to that machine right now, or when I posted the original question. When I do have access in the evening I will try to get better numbers for the question. But I don't think genisomage
is going to be a good solution---although it might be a good way to learn a model of the filesystem
that tells me how how it works. Knowing that block size is 2KB is already helpful.
It may also be useful to know that files in the same directory are burned to the samae DVD, which simplifies the search. I want access to the files directly, which rules out tar-before-burning. (Most files are audio or video, which means there's no point in trying to hit them with gzip
.)