views:

368

answers:

4

I'm looking for an algorithm to segment a sequence of positive numbers into n subsequences, such that the standard deviation of the sum of the numbers in each subset is minimized.

The ordering of the numbers in each subsequence needs to be the same as the ordering in the original sequence

For example:

Suppose I have a sequence {1,1,1,1,1,1,10,1} that i wanted to segment into 2 subsequences.
I believe the optimal solution would be {1,1,1,1,1,1}, {10,1} .

The sum of the 1st subsequence is 6, the sum of the 2nd subsequence is 11
The standard deviation of the two numbers is ~3.5, which i believe is the lowest possible.

Suppose I have a sequence {4,1,1,1,1,6} that i wanted to segment into 3 subsequences.
I believe the optimal solution would be {4}, {1,1,1,1}, {6}
The sum of the subsequences is 4, 4, and 6.
The standard deviation of the 3 numbers is ~1.15, which i believe is the lowest possible.

The best algorithm i was able to come up with was to find the cumulative sum of each of the numbers in the sequence, and segment the sequence at each interval of [totalSum/numSubsequences].

For example, given the sequence {4,1,1,1,1,6} , the cumulative sums of the numbers of each sequence is {4,5,6,7,8,14}. The total of all numbers in the sequence is 14, so, given that i want 3 subsequences, i should segment the sequence when the total reaches 14/3 = 4.66 and 2 * 14/3 = 9.333333.

However, there is no actual place in the sequence where the cumulative total is equal to 4.66 - the first cumulative total is 4, and next cumulative total is 5. So should i round up or should i round down? In this case, rounding down to 4 gives the optimal solution, but that isn't always the case. The best I can think of is to try every combination of rounding up and down, but that results in O(2^numSubsequences) complexity.

This seems to be the type of thing that would have a preexisting algorithm to apply, however my Googling has failed me. I am aware of the Partition Problem, which is NP-complete, but that deals with unordered sets, and not ordered sequences.

Any help would be appreciated.

+1  A: 

One idea that comes to my mind is to use the A* search algorithm.

More about that:

http://en.wikipedia.org/wiki/A*_search_algorithm

Good book to read about that:

Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig

Some things you could use for the A*:

  • Initial State: Split the sequence into n equal (as much as possible) subsequences
  • Next State: For each subset add the left or right number (the last number of subset i-1 (if i != 0) or the first number of subset i+1 (if i != n)) to it (to create all descending nodes of the current state node)
  • Heuristic: The optimal value would be the mean of all values. Its admissible so it can be used with A*.

Im not sure that will really help you with your problem, since i've not solved this problem again, but i think it might do pretty well. It also may not be the most sophisticated solution for this specific problem, but it is surely better than any "try all combinations" approach. It also is sound and complete (because of the admissible heuristic).

If you have more questions about this ask and i will try my best to help you.

George B.
+1  A: 

I think you mean divide into contiguous chunks, or in other words find n-1 places at which to cut the sequence into pieces. (If you really mean to allow subsequences that interleave to create the main sequence, you could probably just sort the sequence, solve the chunk problem, and then track where the individual numbers came from to provide interleaved subsequences).

I think you can solve this in time proportional to n times the sequence length using dynamic programming. Work from left to right to fill in arrays of bestCost[i][j] and lastCut[i][j], where i runs along the sequence and j runs from 0 to n-1. bestCost[i][j] is the cost of the best way of cutting up the sequence from 0 to i into j chunks. lastCut[i][j] is the position of the most recent cut for the cut that produces bestCost[i][j]. bestCost[i + 1][j] = min_k std deviation(i + 1 to k) + bestCost[k - 1][j - 1]. and then lastCut[i + 1][j] = k. At the end you work out the cost of the best answer for n cuts in the same way and then use lastCut[][] to trace your way back to find the other cuts.

mcdowella
Your solution is good, except for a slightly different problem. The OP wants the standard deviation of the sums of the subsequences, not the sum of the standard deviations of the subsequences. Also, this solution is `O(length^2 * #subsequences)` not `O(length * #subsequences)` because computing that minimum in `bestCost[i+1][j]` takes `O(length)` time. Indeed, you have to run over that many different values of `k`. (Incidentally, I started writing my answer before you posted yours. It's no coincidence they're similar because dynamic programming is the way to go here.)
A. Rex
+7  A: 

Suppose the length of the original sequence is L and the number of subsequences is N.

You may simplify the expression for standard deviation to get sqrt(E[X^2] - E[X]^2), where E denotes expectation/average and X denotes your random variable -- in your case, the sum of the subsequences. (A similar formula applies for the "sample standard deviation".) Note that E[X] does not depend on how you split your sequence, because it will always be the total sum divided by N. Thus, we just want to minimize E[X^2] or equivalently, the sum of X^2 (they differ by a factor of N by the definition of average).

At this point, we can see that this problem can be solved with dynamic programming. Let f(i,j), for i from 0 to M and j from 1 to N, be the minimal sum of squares of sums of subsequences from the split of the first i elements of your sequence into j subsequences. Then we see that f(i,j) may be computed in terms of all the f(i',j') with i' <= i and j < j'. More specifically, if your sequence is a[k] indexed from 0 to M-1:

f(i,1) = sum( a[k] for 0 <= k < i )^2
f(i,j) = minimum of  f(l,j-1)+sum( a[k] for l < k < i )^2  for l from 0 to i

Having minimized f(N,L), you can use standard dynamic programming techniques to recover the splits. In particular, you can store the l that minimizes f(i,j).

The runtime of this solution is O(L^2 N) because you compute O(L N) different values of f and the minimum is over O(L) different values of l.

Here's a straightforward implementation in Perl:

#!/usr/bin/perl

use strict;
use warnings;

local $\ = $/;
print join ", ", map {"@$_"} best( 2, qw(1 1 1 1 1 1 10 1) );
# prints "1 1 1 1 1 1, 10 1"

print join ", ", map {"@$_"} best( 3, qw(4 1 1 1 1 6) );
# prints "4, 1 1 1 1, 6"

sub best {
    my( $N, @a ) = @_;

    my( @f, @g, $i, $j, $k, $sum );

    # DP base case
    $sum = 0;
    $f[0][1] = $g[0][1] = 0;
    for $i ( 1 .. @a ) {
        $sum += $a[$i-1];
        $f[$i][1] = $sum * $sum;
        $g[$i][1] = 0;
    }

    # DP recurrence
    for $j ( 2 .. $N ) {
        $f[0][$j] = $g[0][$j] = 0;
        for $i ( 1 .. @a ) {
            $sum = 0;
            $f[$i][$j] = $f[$i][$j-1];
            $g[$i][$j] = $i;
            for $k ( reverse 0 .. $i-1 ) {
                $sum += $a[$k];
                if( $f[$i][$j] > $f[$k][$j-1] + $sum * $sum ) {
                    $f[$i][$j] = $f[$k][$j-1] + $sum * $sum;
                    $g[$i][$j] = $k;
                }
            }
        }
    }

    # Extract best expansion
    my( @result );
    $i = @a; $j = $N;

    while( $j ) {
        $k = $g[$i][$j];
        unshift @result, [@a[$k .. $i-1]];
        $i = $k;
        $j--;
    }

    return @result;
}
A. Rex
Good answer! I was trying to turn this into a DP problem but you got it first :P
Charles Ma
+1 I'm struggling to learn about dynamic programming and am using your answer as an case study. Any additional clarifications you can add would be appreciated. In particular -- and I'm not certain this is a good question -- is there an intuitive explanation for the values in the `@f` matrix after the "DP recurrence" section of the code has finished? Thanks.
FM
$f[$i][$j] contains the answer to a subproblem: the smallest possible sum of squares for the first $i entries from your list, assuming you divide into $j parts. Then $g[$i][$j] contains the location of the start of the last subdivision. The key insight that makes dynamic programming work is that if you take an optimal solution to this problem and take away the last (or first) group of numbers, you have an optimal solution to a smaller problem. Thus, you can build up a solution by solving subproblems. It's also equivalent to recursion with memoization, so if you get that, you've almost got it.
A. Rex
@A. Rex Thanks, that helps a lot.
FM
+1  A: 

I agree that dynamic programming may be the best approach - one technique that I would rule out is non-linear optimization. You have a non-linear objective function whether you're minimizing the square root or just the sum of the squared differences. You also have a integer variables as a part of your constraint set - assigning members to sets requires some integer variables regardless of your formulation. A non-linear optimization with integer variables is generally very difficult if not impossible to solve optimally. If you only need an approximate solutions a genetic algorithm might be a good approach where the genetic string is a representation of the assignment of a member to a set.

As far as doing all this in less than a second.... Good luck !

Grembo