views:

45

answers:

1

Suppose I have a set of objects, S. There is an algorithm f that, given a set S builds certain data structure D on it: f(S) = D. If S is large and/or contains vastly different objects, D becomes large, to the point of being unusable (i.e. not fitting in allotted memory). To overcome this, I split S into several non-intersecting subsets: S = S1 + S2 + ... + Sn and build Di for each subset. Using n structures is less efficient than using one, but at least this way I can fit into memory constraints. Since size of f(S) grows faster than S itself, combined size of Di is much less than size of D.

However, it is still desirable to reduce n, i.e. the number of subsets; or reduce the combined size of Di. For this, I need to split S in such a way that each Si contains "similar" objects, because then f will produce a smaller output structure if input objects are "similar enough" to each other.

The problems is that while "similarity" of objects in S and size of f(S) do correlate, there is no way to compute the latter other than just evaluating f(S), and f is not quite fast.

Algorithm I have currently is to iteratively add each next object from S into one of Si, so that this results in the least possible (at this stage) increase in combined Di size:

for x in S:
    i = such i that
             size(f(Si + {x})) - size(f(Si))
             is min
    Si = Si + {x}

This gives practically useful results, but certainly pretty far from optimum (i.e. the minimal possible combined size). Also, this is slow. To speed up somewhat, I compute size(f(Si + {x})) - size(f(Si)) only for those i where x is "similar enough" to objects already in Si.

Is there any standard approach to such kinds of problems?

I know of branch and bounds algorithm family, but it cannot be applied here because it would be prohibitively slow. My guess is that it is simply not possible to compute optimal distribution of S into Si in reasonable time. But is there some common iteratively improving algorithm?

EDIT:

As comments noted, I never defined "similarity". In fact, all I want is to split in such subsets Si that combined size of Di = f(Si) is minimal or at least small enough. "Similarity" is defined only as this and unfortunately simply cannot be computed easily. I do have a simple approximation, but it is only that — an approximation.

So, what I need is a (likely heuristical) algorithm that minimizes sum f(Si) given that there is no simple way to compute the latter — only approximations I use to throw away cases that are very unlikely to give good results.

+1  A: 

About the slowness I found that in similar problems a good-enough solution is to compute the match just by picking a fixed number of random candidates.

True that the result will not be the best one (often worse than the full "greedy" solution you implemented) but it in my experience not too bad and you can decide the speed... it can even be implemented in a prescribed amount of time (that is you keep searching until the allocated time expires).

Another option I use is to keep searching until I see no improvement for a while.

To get past the greedy logic you could keep a queue of N "x" elements and trying to pack them simultaneously in groups of "k" (with k < N). In this case I found that is important to also keep the "age" of an element in the queue and to use it as a "prize" for the result to avoid keeping "bad" elements forever in the queue because others will always match better (this would make the queue search useless and the results would be basically the same as the greedy approach).

6502
I'm not sure anything will work, but there are certainly some good ideas here worth trying.
doublep
Also to note, my solution is not "full". It selects best possible candidate at each given iteration, yes, but that is not the same as best partioning overall.
doublep
Choosing the single most promising move is what is conventionally called the "greedy" approach... I called your solution full-greedy because you said you are trying all moves. When there are many such possible moves I simply found that in several problems a quasi-greedy that only checks a random subset of the possible moves is not much worse as value but can be a lot faster.
6502
@6502: Ah, I see. However, I don't try all possible moves. As noted in the question, I only try those "where `x` is "similar enough" to objects already in `Si`". I.e. I discard cases that are very unlikely to give any good resutls according to heuristic.
doublep