views:

481

answers:

5

I'm archiving data to DVD, and I want to pack the DVDs full. I know the names and sizes of all the files I want on the DVD, but I don't know how much space is taken up by metadata. I want to get as many files as possible onto each DVD, so I'm using a Bubblesearch heuristic with greedy bin-packing. I try 10,000 alternatives and get the best one. Currently I know the sizes of all the files and because I don't know how files are stored in an ISO 9660 filesystem, I add a lot of slop for metadata. I'd like to cut down the slop.

I could use genisoimage -print-size except it is too slow---given 40,000 files occupying 500MB, it takes about 3 seconds. Taking 8 hours per DVD is not in the cards. I've modified the genisoimage source before and am really not keen to try to squeeze the algorithm out of the source code; I am hoping someone knows a better way to get an estimate or can point me to a helpful specification.


Clarifying the problem and the question:

  • I need to burn archives that split across multiple DVDs, typically around five at a time. The problem I'm trying to solve is to decide which files to put on each DVD, so that each DVD (except the last) is as full as possible. This problem is NP-hard.

  • I'm using the standard greedy packing algorithm where you place the largest file first and you put it in the first DVD having sufficient room. So j_random_hacker, I am definitely not starting from random. I start from sorted and use Bubblesearch to perturb the order in which the files are packed. This procedure improves my packing from around 80% of estimated capacity to over 99.5% of estimated capacity. This question is about doing a better job of estimating the capacity; currently my estimated capacity is lower than real capacity.

  • I have written a program that tries 10,000 perturbations, each of which involves two steps:

    1. Choose a set of files
    2. Estimate how much space those files will take on DVD

    Step 2 is the step I'm trying to improve. At present I am "erring on the side of caution" as Tyler D suggests. But I'd like to do better. I can't afford to use genisomage -print-size because it's too slow. Similarly, I can't tar the files to disk, because on only is it too slow, but a tar file is not the same size as an ISO 9660 image. It's the size of the ISO 9660 image I need to predict. In principle this could be done with complete accuracy, but I don't know how to do it. That's the question.


Note: these files are on a machine with 3TB of hard-drive storage. In all cases the average size of the files is at least 10MB; sometimes it is significantly larger. So it is possible that genisomage will be fast enough after all, but I doubt it---it appears to work by writing the ISO image to /dev/null, and I can't imagine that will be fast enough when the image size approaches 4.7GB. I don't have access to that machine right now, or when I posted the original question. When I do have access in the evening I will try to get better numbers for the question. But I don't think genisomage is going to be a good solution---although it might be a good way to learn a model of the filesystem that tells me how how it works. Knowing that block size is 2KB is already helpful.

It may also be useful to know that files in the same directory are burned to the samae DVD, which simplifies the search. I want access to the files directly, which rules out tar-before-burning. (Most files are audio or video, which means there's no point in trying to hit them with gzip.)

+1  A: 

Can't use tar to store the files on disk? It's unclear if you're writing a program to do this, or simply making some backups.

Maybe do some experimentation and err on the side of caution - some free space on a disk wouldn't hurt.

Somehow I imagine you've already considered these, or that my answer is missing the point.

Tyler D
+2  A: 

I'm not sure exactly how you are currently doing this -- according to my googling, "Bubblesearch" refers to a way to choose an ordering of items that is in some sense near a greedy ordering, but in your case, the order of adding files to a DVD does not change the space requirements so this approach wastes time considering multiple different orders that amount to the same set of files.

In other words, if you are doing something like the following to generate a candidate file list:

  1. Randomly shuffle the list of files.
  2. Starting from the top of the list, greedily choose all files that you estimate will fit on a DVD until no more will.

Then you are searching the solution space inefficiently -- for any final candidate set of n files, you are potentially considering all n! ways of producing that set. My suggestion:

  1. Sort all files in decreasing order of file size.
  2. Mark the top (largest) file as "included," and remove it from the list. (It must be included on some DVD, so we might as well include it now.)
  3. Can the topmost file in the list be included without the (estimated) ISO filesystem size exceeding the DVD capacity? If so:
    • With probability p (e.g. p = 0.5), mark the file as "included".
  4. Remove the topmost file from the list.
  5. If the list is now empty, you have a candidate list of files. Otherwise, goto 3.

Repeat this many times and choose the best file list.

Tyler D's suggestion is also good: if you have ~40000 files totalling ~500Mb, that means an average file size of 12.5Kb. ISO 9660 uses a block size of 2Kb, meaning those files are wasting on average 1Kb of disk space, or about 8% of their size. So packing them together with tar first will save around 8% of space.

j_random_hacker
@jrh: my algorithm is similar but not identical.If you want to post a question 'when burning files to multiple DVDs, how can I pack each DVD as full as possible', I'll try to give a detailed answer. (Best to email me with the URL of the question.)
Norman Ramsey
+2  A: 

Thanks for the detailed update. I'm satisfied that your current bin-packing strategy is pretty efficient.

As to the question, "Exactly how much overhead does an ISO 9660 filesystem pack on for n files totalling b bytes?" there are only 2 possible answers:

  1. Someone has already written an efficient tool for measuring exactly this. A quick Google search turned up nothing however which is discouraging. It's possible someone on SO will respond with a link to their homebuilt tool, but if you get no more responses for a few days then that's probably out too.
  2. You need to read the readily available ISO 9660 specs and build such a tool yourself.

Actually, there is a third answer:

(3) You don't really care about using every last byte on each DVD. In that case, grab a small representative handful of files of different sizes (say 5), pad them till they are multiples of 2048 bytes, and put all 2^5 possible subsets through genisoimage -print-size. Then fit the equation *nx + y = iso_size - total_input_size* on that dataset, where n = number of files in a given run, to find x, which is the number of bytes of overhead per file, and y, which is the constant amount of overhead (the size of an ISO 9660 filesystem containing no files). Round x and y up and use that formula to estimate your ISO filesystem sizes for a given set of files. For safety, make sure you use the longest filenames that appear anywhere in your collection for the test filenames, and put each one under a separate directory hierarchy that is as deep as the deepest hierarchy in your collection.

j_random_hacker
A: 

Nice thinking, J. Random. Of course I don't need every last byte, this is mostly for fun (and bragging rights at lunch). I want to be able to type du at the CD-ROM and have it very close to 4700000000.

I looked at the ECMA spec but like most specs it's medium painful and I have no confidence in my ability to get it right. Also it appears not to discuss Rock Ridge extensions, or if it does, I missed it.

I like your idea #3 and think I will carry it a bit further: I'll try to build a fairly rich model of what's going on and then use genisoimage -print-size on a number of filesets to estimate the parameters of the model. Then I can use the model to do my estimation. This is a hobby project so it will take a while, but I will get around to it eventually. I will post an answer here to say how much wastage is eliminated!

Norman Ramsey
Thanks Norman. I know what you mean, sometimes optimisation is fun just for it's own sake :)I realised that there will actually be some overhead in the ISO image even when no files are present, and edited the "model equation" in my 2nd post to reflect that.Let me know how it goes!
j_random_hacker
+1  A: 

I recently ran an experiment to find a formula to do a similar filling estimate on dvds, and found a simple formula given some assumptions... from your original post this formula will likely be a low number for you, it sounds like you have multiple directories and longer file names.

Assumptions:

  • all the files are exactly 8.3 characters.
  • all the files are in the root directory.
  • no extensions such as Joliet.

The formula:

174 + floor(count / 42) + sum( ceil(file_size / 2048) )
  • count is the number of files
  • file_size is each file's size in bytes
  • the result is in 2048 byte blocks.

An example script:

#!/usr/bin/perl -w
use strict;
use POSIX;

sub sum {
    my $out = 0;
    for(@_) {
        $out += $_;
    }
    return $out;
}

my @sizes = ( 2048 ) x 1000;
my $file_count = @sizes;

my $data_size = sum(map { ceil($_ / 2048) } @sizes);
my $dir_size = floor( $file_count / 42 ) + 1;
my $overhead = 173;

my $size = $overhead + $dir_size + $data_size;

$\ = "\n";
print $size;

I verified this on disks with up to 150k files, with sizes ranging from 200 bytes to 1 MiB.

Sarah Happy
I want long filenames and Rock Ridge extensions, but +1 for helping out witih an old, inactive question!
Norman Ramsey