ansaurus

Question

How can I predict the size of an ISO 9660 filesystem?

Answer 1

+1 A:

Can't use tar to store the files on disk? It's unclear if you're writing a program to do this, or simply making some backups.

Maybe do some experimentation and err on the side of caution - some free space on a disk wouldn't hurt.

Somehow I imagine you've already considered these, or that my answer is missing the point.

Tyler D 2009-01-22 06:39:45

Answer 2

+2 A:

I'm not sure exactly how you are currently doing this -- according to my googling, "Bubblesearch" refers to a way to choose an ordering of items that is in some sense near a greedy ordering, but in your case, the order of adding files to a DVD does not change the space requirements so this approach wastes time considering multiple different orders that amount to the same set of files.

In other words, if you are doing something like the following to generate a candidate file list:

Randomly shuffle the list of files.
Starting from the top of the list, greedily choose all files that you estimate will fit on a DVD until no more will.

Then you are searching the solution space inefficiently -- for any final candidate set of n files, you are potentially considering all n! ways of producing that set. My suggestion:

Sort all files in decreasing order of file size.
Mark the top (largest) file as "included," and remove it from the list. (It must be included on some DVD, so we might as well include it now.)
Can the topmost file in the list be included without the (estimated) ISO filesystem size exceeding the DVD capacity? If so:
- With probability p (e.g. p = 0.5), mark the file as "included".
Remove the topmost file from the list.
If the list is now empty, you have a candidate list of files. Otherwise, goto 3.

Repeat this many times and choose the best file list.

Tyler D's suggestion is also good: if you have ~40000 files totalling ~500Mb, that means an average file size of 12.5Kb. ISO 9660 uses a block size of 2Kb, meaning those files are wasting on average 1Kb of disk space, or about 8% of their size. So packing them together with tar first will save around 8% of space.

j_random_hacker 2009-01-22 07:28:19

@jrh: my algorithm is similar but not identical.If you want to post a question 'when burning files to multiple DVDs, how can I pack each DVD as full as possible', I'll try to give a detailed answer. (Best to email me with the URL of the question.)

Norman Ramsey 2009-01-22 15:07:27

Answer 3

+2 A:

Thanks for the detailed update. I'm satisfied that your current bin-packing strategy is pretty efficient.

As to the question, "Exactly how much overhead does an ISO 9660 filesystem pack on for n files totalling b bytes?" there are only 2 possible answers:

Someone has already written an efficient tool for measuring exactly this. A quick Google search turned up nothing however which is discouraging. It's possible someone on SO will respond with a link to their homebuilt tool, but if you get no more responses for a few days then that's probably out too.
You need to read the readily available ISO 9660 specs and build such a tool yourself.

Actually, there is a third answer:

(3) You don't really care about using every last byte on each DVD. In that case, grab a small representative handful of files of different sizes (say 5), pad them till they are multiples of 2048 bytes, and put all 2^5 possible subsets through genisoimage -print-size. Then fit the equation *nx + y = iso_size - total_input_size* on that dataset, where n = number of files in a given run, to find x, which is the number of bytes of overhead per file, and y, which is the constant amount of overhead (the size of an ISO 9660 filesystem containing no files). Round x and y up and use that formula to estimate your ISO filesystem sizes for a given set of files. For safety, make sure you use the longest filenames that appear anywhere in your collection for the test filenames, and put each one under a separate directory hierarchy that is as deep as the deepest hierarchy in your collection.

j_random_hacker 2009-01-22 15:54:18

Answer 4

A:

Nice thinking, J. Random. Of course I don't need every last byte, this is mostly for fun (and bragging rights at lunch). I want to be able to type du at the CD-ROM and have it very close to 4700000000.

I looked at the ECMA spec but like most specs it's medium painful and I have no confidence in my ability to get it right. Also it appears not to discuss Rock Ridge extensions, or if it does, I missed it.

I like your idea #3 and think I will carry it a bit further: I'll try to build a fairly rich model of what's going on and then use genisoimage -print-size on a number of filesets to estimate the parameters of the model. Then I can use the model to do my estimation. This is a hobby project so it will take a while, but I will get around to it eventually. I will post an answer here to say how much wastage is eliminated!

Norman Ramsey 2009-01-23 03:47:50

Thanks Norman. I know what you mean, sometimes optimisation is fun just for it's own sake :)I realised that there will actually be some overhead in the ISO image even when no files are present, and edited the "model equation" in my 2nd post to reflect that.Let me know how it goes!

j_random_hacker 2009-01-23 18:30:47

Answer 5

+1 A:

I recently ran an experiment to find a formula to do a similar filling estimate on dvds, and found a simple formula given some assumptions... from your original post this formula will likely be a low number for you, it sounds like you have multiple directories and longer file names.

Assumptions:

all the files are exactly 8.3 characters.
all the files are in the root directory.
no extensions such as Joliet.

The formula:

174 + floor(count / 42) + sum( ceil(file_size / 2048) )

count is the number of files
file_size is each file's size in bytes
the result is in 2048 byte blocks.

An example script:

#!/usr/bin/perl -w
use strict;
use POSIX;

sub sum {
    my $out = 0;
    for(@_) {
        $out += $_;
    }
    return $out;
}

my @sizes = ( 2048 ) x 1000;
my $file_count = @sizes;

my $data_size = sum(map { ceil($_ / 2048) } @sizes);
my $dir_size = floor( $file_count / 42 ) + 1;
my $overhead = 173;

my $size = $overhead + $dir_size + $data_size;

$\ = "\n";
print $size;

I verified this on disks with up to 150k files, with sizes ranging from 200 bytes to 1 MiB.

Sarah Happy 2009-06-02 17:59:38

I want long filenames and Rock Ridge extensions, but +1 for helping out witih an old, inactive question!

Norman Ramsey 2009-06-03 01:43:09

ansaurus

tags:

views:

answers:

How can I predict the size of an ISO 9660 filesystem?

related questions