ansaurus

Question

What algorithm to use to segment a sequence of numbers into n subsets, to minimize the standard deviation of the sum of the numbers in each subset

Answer 1

+1 A:

One idea that comes to my mind is to use the A* search algorithm.

More about that:

http://en.wikipedia.org/wiki/A*_search_algorithm

Good book to read about that:

Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig

Some things you could use for the A*:

Initial State: Split the sequence into n equal (as much as possible) subsequences
Next State: For each subset add the left or right number (the last number of subset i-1 (if i != 0) or the first number of subset i+1 (if i != n)) to it (to create all descending nodes of the current state node)
Heuristic: The optimal value would be the mean of all values. Its admissible so it can be used with A*.

Im not sure that will really help you with your problem, since i've not solved this problem again, but i think it might do pretty well. It also may not be the most sophisticated solution for this specific problem, but it is surely better than any "try all combinations" approach. It also is sound and complete (because of the admissible heuristic).

If you have more questions about this ask and i will try my best to help you.

George B. 2010-01-30 03:43:21

Answer 2

+1 A:

I think you mean divide into contiguous chunks, or in other words find n-1 places at which to cut the sequence into pieces. (If you really mean to allow subsequences that interleave to create the main sequence, you could probably just sort the sequence, solve the chunk problem, and then track where the individual numbers came from to provide interleaved subsequences).

I think you can solve this in time proportional to n times the sequence length using dynamic programming. Work from left to right to fill in arrays of bestCost[i][j] and lastCut[i][j], where i runs along the sequence and j runs from 0 to n-1. bestCost[i][j] is the cost of the best way of cutting up the sequence from 0 to i into j chunks. lastCut[i][j] is the position of the most recent cut for the cut that produces bestCost[i][j]. bestCost[i + 1][j] = min_k std deviation(i + 1 to k) + bestCost[k - 1][j - 1]. and then lastCut[i + 1][j] = k. At the end you work out the cost of the best answer for n cuts in the same way and then use lastCut[][] to trace your way back to find the other cuts.

mcdowella 2010-01-30 06:12:24

Your solution is good, except for a slightly different problem. The OP wants the standard deviation of the sums of the subsequences, not the sum of the standard deviations of the subsequences. Also, this solution is `O(length^2 * #subsequences)` not `O(length * #subsequences)` because computing that minimum in `bestCost[i+1][j]` takes `O(length)` time. Indeed, you have to run over that many different values of `k`. (Incidentally, I started writing my answer before you posted yours. It's no coincidence they're similar because dynamic programming is the way to go here.)

A. Rex 2010-01-30 06:24:11

Answer 3

+7 A:

Suppose the length of the original sequence is L and the number of subsequences is N.

You may simplify the expression for standard deviation to get sqrt(E[X^2] - E[X]^2), where E denotes expectation/average and X denotes your random variable -- in your case, the sum of the subsequences. (A similar formula applies for the "sample standard deviation".) Note that E[X] does not depend on how you split your sequence, because it will always be the total sum divided by N. Thus, we just want to minimize E[X^2] or equivalently, the sum of X^2 (they differ by a factor of N by the definition of average).

At this point, we can see that this problem can be solved with dynamic programming. Let f(i,j), for i from 0 to M and j from 1 to N, be the minimal sum of squares of sums of subsequences from the split of the first i elements of your sequence into j subsequences. Then we see that f(i,j) may be computed in terms of all the f(i',j') with i' <= i and j < j'. More specifically, if your sequence is a[k] indexed from 0 to M-1:

f(i,1) = sum( a[k] for 0 <= k < i )^2
f(i,j) = minimum of  f(l,j-1)+sum( a[k] for l < k < i )^2  for l from 0 to i

Having minimized f(N,L), you can use standard dynamic programming techniques to recover the splits. In particular, you can store the l that minimizes f(i,j).

The runtime of this solution is O(L^2 N) because you compute O(L N) different values of f and the minimum is over O(L) different values of l.

Here's a straightforward implementation in Perl:

#!/usr/bin/perl

use strict;
use warnings;

local $\ = $/;
print join ", ", map {"@$_"} best( 2, qw(1 1 1 1 1 1 10 1) );
# prints "1 1 1 1 1 1, 10 1"

print join ", ", map {"@$_"} best( 3, qw(4 1 1 1 1 6) );
# prints "4, 1 1 1 1, 6"

sub best {
    my( $N, @a ) = @_;

    my( @f, @g, $i, $j, $k, $sum );

    # DP base case
    $sum = 0;
    $f[0][1] = $g[0][1] = 0;
    for $i ( 1 .. @a ) {
        $sum += $a[$i-1];
        $f[$i][1] = $sum * $sum;
        $g[$i][1] = 0;
    }

    # DP recurrence
    for $j ( 2 .. $N ) {
        $f[0][$j] = $g[0][$j] = 0;
        for $i ( 1 .. @a ) {
            $sum = 0;
            $f[$i][$j] = $f[$i][$j-1];
            $g[$i][$j] = $i;
            for $k ( reverse 0 .. $i-1 ) {
                $sum += $a[$k];
                if( $f[$i][$j] > $f[$k][$j-1] + $sum * $sum ) {
                    $f[$i][$j] = $f[$k][$j-1] + $sum * $sum;
                    $g[$i][$j] = $k;
                }
            }
        }
    }

    # Extract best expansion
    my( @result );
    $i = @a; $j = $N;

    while( $j ) {
        $k = $g[$i][$j];
        unshift @result, [@a[$k .. $i-1]];
        $i = $k;
        $j--;
    }

    return @result;
}

A. Rex 2010-01-30 06:19:05

Good answer! I was trying to turn this into a DP problem but you got it first :P

Charles Ma 2010-01-30 08:25:42

+1 I'm struggling to learn about dynamic programming and am using your answer as an case study. Any additional clarifications you can add would be appreciated. In particular -- and I'm not certain this is a good question -- is there an intuitive explanation for the values in the `@f` matrix after the "DP recurrence" section of the code has finished? Thanks.

FM 2010-02-01 00:28:02

$f[$i][$j] contains the answer to a subproblem: the smallest possible sum of squares for the first $i entries from your list, assuming you divide into $j parts. Then $g[$i][$j] contains the location of the start of the last subdivision. The key insight that makes dynamic programming work is that if you take an optimal solution to this problem and take away the last (or first) group of numbers, you have an optimal solution to a smaller problem. Thus, you can build up a solution by solving subproblems. It's also equivalent to recursion with memoization, so if you get that, you've almost got it.

A. Rex 2010-02-01 09:06:36

@A. Rex Thanks, that helps a lot.

FM 2010-02-01 13:10:25

Answer 4

+1 A:

I agree that dynamic programming may be the best approach - one technique that I would rule out is non-linear optimization. You have a non-linear objective function whether you're minimizing the square root or just the sum of the squared differences. You also have a integer variables as a part of your constraint set - assigning members to sets requires some integer variables regardless of your formulation. A non-linear optimization with integer variables is generally very difficult if not impossible to solve optimally. If you only need an approximate solutions a genetic algorithm might be a good approach where the genetic string is a representation of the assignment of a member to a set.

As far as doing all this in less than a second.... Good luck !

Grembo 2010-02-01 15:55:34

ansaurus

tags:

views:

answers:

What algorithm to use to segment a sequence of numbers into n subsets, to minimize the standard deviation of the sum of the numbers in each subset

related questions