ansaurus

Question

Finding the optimum file size combination

Answer 1

+3 A:

Sounds like you have a hard problem there. This problem is well known, but no efficient solutions (can?) exist.

lbruder 2010-09-02 07:22:40

efficient *approximate* solutions do exist.

Alexandre C. 2010-09-02 11:59:48

Reasonably efficient exact solutions exist. The dynamic programming solution is pseudo-polynomial. And fortunately, the size of a DVD is constant, at least until you get a Blu-Ray-W drive. So I'd at least give the DP solution a go. It might fail for large directories, true, but I don't actually know whether "large" is more or less than, say, 10000 files. Tip: round all the files sizes to the DVD filesystem block size first: it will considerably speed up the DP solution and give *more* accurate results.

Steve Jessop 2010-09-02 12:59:29

(1 hour later) I now know that "large" is less than 10000 files, having given the DP solution a go. But it's not an outrageously hard problem, it's in that annoying about-an-order-of-magnitude-unless-I-can-think-of-something-clever zone.

Steve Jessop 2010-09-02 14:43:03

Answer 2

A:

Other then the obvious way of trying all permuations of objects with size < bucket, you could also have a look at the implementation of the bucketizer perl module, which does exactly what you are asking for. I'm not sure what it does exactly, but the manual mentions that there is one "brute force" way, so I'm assuming there must also be some kind of optimization.

Adrian Grigore 2010-09-02 07:25:54

Answer 3

A:

Thank you for your answers.

I looked into this problem more now with the guidance of the given answers. Among other things I found this webpage, http://www.mathmaniacs.org/lessons/C-subsetsum/index.html. It tells about the subset sum problem, which I believe is the problem I described here.

One sentence from the webpage is this:

--

You may want to point out that a number like 2300 is so large that even a computer counting at a speed of over a million or billion each second, would not reach 2300 until long after our sun had burned out.

--

Personally I would have more use for this algorithm when comparing a larger amount of file sizes than let's say 10 or less as it is somehow easy to reach the probably biggest sum just by trial and error manually if the number of files is low.

A CD with mp3:s can easily have 100 mp3s and a DVD a lot more, which leads to the sun burning out before I have the answer :).

Randomly trying to find the optimum sum can apparently get you pretty close but it can never be guaranteed to be the optimum answer and can also with bad luck be quite far away. Brute-force is the only real way it seems to get the optimum answer and that would take way too long.

So I guess I just continue estimating manually a good combination of files to burn on CDs and DVDs. :)

Thanks for the help. :)

CKa 2010-09-02 11:25:14

@CKa: no, you're being unduly pessimistic. I have just worked up some code in Python, completely unoptimised. Assuming a block size of 2kb, it will solve the decision version of the problem for 100 files (random sizes between 2k-6k blocks) and a disk of size 40000 blocks (that is, 80MB), in 100 seconds. Obviously that's not yet up to your problem, but it is at least in the ballpark, and the sun hasn't burnt out yet that I've noticed ;-). Despite what you've read, the accurate solution is actually O(n*m), where n is the number of files and m is the size of the DVD. It is *not* exponential.

Steve Jessop 2010-09-02 13:54:46

Answer 4

+5 A:

This is, as other pointed out, the Knapsack Problem, which is a combinatorial optimization problem. It means that you look for some subset or permutation of a set which minimizes (or maximizes) a certain cost. Another well known such problem is the Traveling Salesman Problem.

Such problems are usually very hard to solve. But if you're interested in almost optimal solutions, you can use non-deterministic algorithms, like simulated annealing. You most likely won't get the optimal solution, but a nearly optimal one.

This link explains how simulated annealing can solve the Knapsack Problem, and therefore should be interesting to you.

Alexandre C. 2010-09-02 11:44:43

All true save one thing - you ?could? get an optimal solution, there is just no *guarantee* that you will - depends on the size of the search space and how lucky you are - certainly unlikely, but so is the lottery, and people do win......

Mark Mullin 2010-09-02 12:07:01

@Mark: Actually for the Traveling Salesman Problem, the optimality of the solution found by non-deterministic methods can be quantified. You're guaranteed to have a solution within XXX of the optimal one (in term of cost) after NNN steps with probability YYY, and you have a formula relating XXX, YYY and NNN. It is very good in practice.

Alexandre C. 2010-09-02 12:25:22

@Alexandre C:true - you are guaranteed that the solution will be within XXX of optimal - my observation was only that you could get lucky and hit the optimal solution - I was nitpicking your original post when you said you "won't" get the optimal solution - you might, but I certainly wouldn't bet on it :-)

Mark Mullin 2010-09-02 14:57:12

@Mark: Okay, edited.

Alexandre C. 2010-09-02 15:02:37

Answer 5

+4 A:

Just for fun I tried out the accurate dynamic programming solution. Written in Python, because of my supreme confidence that you shouldn't optimise until you have to ;-)

This could provide either a start, or else a rough idea of how close you can get before resorting to approximation.

Code based on http://en.wikipedia.org/wiki/Knapsack_problem#0-1_knapsack_problem, hence the less-than-informative variable names m, W, w, v.

#!/usr/bin/python

import sys

solcount = 0

class Solution(object):
    def __init__(self, items):
        object.__init__(self)
        #self.items = items
        self.value = sum(items)
        global solcount
        solcount += 1
    def __str__(self):
        #return str(self.items) + ' = ' + str(self.value)
        return ' = ' + str(self.value)

m = {}

def compute(v, w):
    coord = (len(v),w)
    if coord in m:
        return m[coord]
    if len(v) == 0 or w == 0:
        m[coord] = Solution([])
        return m[coord]
    newvalue = v[0]
    newarray = v[1:]
    notused = compute(newarray, w)
    if newvalue > w:
        m[coord] = notused
        return notused
    # used = Solution(compute(newarray, w - newvalue).items + [newvalue])
    used = Solution([compute(newarray, w - newvalue).value] + [newvalue])
    best = notused if notused.value >= used.value else used
    m[coord] = best
    return best

def main():
    v = [int(l) for l in open('filesizes.txt')]
    W = int(sys.argv[1])
    print len(v), "items, limit is", W
    print compute(v, W)
    print solcount, "solutions computed"

if __name__ == '__main__':
    main()

For simplicity I'm just considering the file sizes: once you have the list of sizes that you want to use, you can find some filenames with those sizes by searching through a list, so there's no point tangling up filenames in the core, slow part of the program. I'm also expressing everything in multiples of the block size.

As you can see, I've commented out the code that gives the actual solution (as opposed to the value of the solution). That was to save memory - the proper way to store the list of files used isn't one list in each Solution, it's to have each solution point back to the Solution it was derived from. You can then calculate the list of filesizes at the end by going back through the chain, outputting the difference between the values at each step.

With a list of 100 randomly-generated file sizes in the range 2000-6000 (I'm assuming 2k blocks, so that's files of size 4-12MB), this solves for W=40K in 100 seconds on my laptop. In doing so it computes 2.6M of a possible 4M solutions.

Complexity is O(W*n), where n is the number of files. This does not contradict the fact that the problem is NP-complete. So I am at least approaching a solution, and this is just in unoptimised Python.

Clearly some optimisation is now required, because actually it needs to be solved for W=4M (8GB DVD) and however many files you have (lets say a few thousand). Presuming that the program is allowed to take 15 minutes (comparable to the time required to write a DVD), that means performance is currently short by a factor of roughly 10^3. So we have a problem that's quite hard to solve quickly and accurately on a PC, but not beyond the bounds of technology.

Memory use is the main concern, since once we start hitting swap we'll slow down, and if we run out of virtual address space we're in real trouble because we have to implement our own storage of Solutions on disk. My test run peaks at 600MB. If you wrote the code in C on a 32-bit machine, each "solution" has a fixed size of 8 bytes. You could therefore generate a massive 2-D array of them without doing any memory allocation in the loop, but in 2GB of RAM you could only handle W=4M and n=67. Oops - DVDs are out. It could very nearly solve for 2-k blocksize CDs, though: W=350k gives n=766.

Edit: MAK's suggestion to compute iteratively bottom-up, rather than recursively top-down, should massively reduce the memory requirement. First calculate m(1,w) for all 0 <= w <= W. From this array, you can calculate m(2,w) for all 0 <= w <= W. Then you can throw away all the m(1,w) values: you won't need them to calculate m(3,w) etc.

By the way, I suspect that actually the problem you want to solve might be the bin packing problem, rather than just the question of how to get the closest possible to filling a DVD. That's if you have a bunch of files, you want to write them all to DVD, using as few DVDs as possible. There are situations where solving the bin packing problem is very easy, but solving this problem is hard. For example, suppose that you have 8GB disks, and 15GB of small files. It's going to take some searching to find the closest possible match to 8GB, but the bin-packing problem would be trivially solved just by putting roughly half the files on each disk - it doesn't matter exactly how you divide them because you're going to waste 1GB of space whatever you do.

All that said, there are extremely fast heuristics that give decent results much of the time. Simplest is to go through the list of files (perhaps in decreasing order of size), and include each file if it fits, exclude it otherwise. You only need to fall back to anything slow if fast approximate solutions aren't "good enough", for your choice of "enough".

Steve Jessop 2010-09-02 14:12:13

+1. Possible optimizations (for the OP): use a bottom up DP instead of recursion, or at least replace the `dict` with a 2D array or `list` of `list`s.

MAK 2010-09-02 14:20:58

And another optimisation - don't create a new list `newarray`, use the same list but pass an index around. Together with a 2-D array for `m`, and assuming a switch to a language such as C, that eliminates the last memory allocation from `compute`.

Steve Jessop 2010-09-02 14:30:46

@MAK: I'm not *too* worried about the recursion, since it only goes `n` deep. For n >= 1000, that means calling `sys.setrecursionlimit` in Python.

Steve Jessop 2010-09-02 14:45:46

In dynamic programming lingo, what are the "stages" and what is the "state variable" at each stage? This would help me to understand your solution.

Cary Swoveland 2010-09-02 19:12:11

@Cary: follow the Wikipedia link. It defines `m(i,w)` as the greatest value obtainable by selecting from the first `i` items (I've actually used the last `i`) and the selection having total weight <= w (and of course in this problem value = weight).

Steve Jessop 2010-09-03 12:41:48

@Steve: Thanks. I was thinking of a multiple knapsack problem, assuming the poster wanted to load up many (but as few as possible)disks.

Cary Swoveland 2010-09-03 14:59:26

You're probably right, but the questioner did *ask* how to use as much as possible of the disk. If they want bin-packing too, they can ask for that too. Maybe if you had a continual stream of incoming files, that long-term you wanted to fit on as few disks as possible, you would let your buffer (HD) fill until either you can create a precisely full disk (in which case burn it), or else your buffer is full (in which case burn the best you can). Then you would want a solution to knapsack even though long-term you're in effect doing bin packing.

Steve Jessop 2010-09-03 16:52:35

Answer 6

A:

If you're looking for a reasonable heuristic, and the objective is to minimize the number of disks required, here's a simple one you might consider. It's similar to one I used recently for a job-shop problem. I was able to compare it to known optima, and found it provided allocations that were either optimal or extremely close to being optimal.

Suppose B is the size of all files combined and C is the capacity of each disk. Then you will need at least n = roundup(B/C) disks. Try to fit all the files on n disks. If you are able to do so, you're finished, and have an optimal solution. Otherwise, try to fit all the files on n+1 disks. If you are able to do so, you have a heuristic solution; otherwise try to fit the files on n+2 disks, and so on, until you are able to do so.

For any given allocation of files to disks below (which may exceed some disk capacities), let si be the combined size of files allocated to disk i, and t = max si. We are finished when t <=C.

First, order (and index) the files largest to smallest.

For m >= n disks,

Allocate the files to the disks in a back-in-forth way: 1->1, 2->2, ... m->m, m+1>m-1, m+2->m-2, ... 2m->1, 2m+1->2, 2m+2->3 ... 3m->m, 3m+1->m-1, and so on until all files are allocated, with no regard to disk capacity. If t <= C we are finished (and the allocation is optimal if m = n); else go to #2.
Attempt to reduce t by moving a file from a disk i with si = t to another disk, without increasing t. Continue doing this until t <= C, in which case we are finished (and the allocation is optimal if m = n), or t cannot be reduced further, in which case go to #3.
Attempt to reduce t by performing pairwise exchanges between disks. Continue doing this until t <= C, in which case we are finished (and the allocation is optimal if m = n), or t cannot be reduced further with pairwise exchanges. In the latter case, repeat #2, unless no improvement was made the last time #2 was repeated, in which case increment m by one and repeat #1.

In #2 and #3 there are of course different ways to order possible reallocations and pairwise exchanges.

Cary Swoveland 2010-09-02 19:01:48

ansaurus

tags:

views:

answers:

Finding the optimum file size combination

related questions