views:

511

answers:

13

Hello,

Given a set of n numbers (1 <= n <= 100) where each number is an integer between 1 and 450,we need to distribute those set of numbers into two sets A and B, such that the following two cases hold true:

  1. The total numbers in each set differ by at most 1.
  2. The sum of all the numbers in A is as nearly equal as possible to the sum of all the numbers in B i.e. the distribution should be fair.

Can someone please suggest an efficient algorithm for solving the above problem ?

Thank You.

+2  A: 

If the numbers are sequential then you just alternate assigning them between A and B.

I suspect they are not, in which case...

Assign the largest unassigned number to the group with the lowest sum unless the difference in size of the the groups is less than or equal to count of unassigned numbers (in which case assign all of the remaining numbers to smaller group).

It won't find the best solution in all cases, but its close and simple.

RichH
Unfortunately, alternation doesn't quite meet the specified goals. Imagine the sequence 1, 2, 3: assigning numbers alternately would give you A = { 1, 3 } and B = { 2 }, but a more fair distribution would be A = { 1, 2 } and B = { 3 }
C Pirate
I did say it wouldn't find the best solution, but it is a really simple way to find a solution that is close.
RichH
+2  A: 

Never mind, I thought the numbers were sequential. This looks kind of like the Knapsack Problem, which is NP hard.


The numbers are sequential?

  1. Put the largest number in A
  2. Put the next largest number in B
  3. Put the next largest number in B
  4. Put the next largest number in A
  5. Repeat step 1 until all the numbers are assigned.

Proof:

After every multiple of 4 numbers has been assigned, A and B both contain the same number of items, and the sum of the items in each group are the same because

(n) + (n - 3) == (n - 1) + (n - 2)

In the last iteration we are at Step 1 above and we have either 0, 1 [1], 2 [1,2], or 3 [1,2,3] numbers remaining.

In case 0, we are done and the groups are equal in count and weight.

In case 1, we assign the number 1 to group A. Group A has one more item and one more weight. This is as fair as we can get in this situation.

In case 2, we assign the number 2 to group A and the number 1 to group B. Now the groups have the same number of items and group A has one extra weight. Again, this is as fair as we can get.

In case 3, assign the number 3 to group A, and assign numbers 2 and 1 to group B. Now the groups have the same weight (3 == 2 + 1) and group B has one extra item.

mobrule
Though the knapsack problem is definitely NP hard, this problem (with constraints on input) can be solved in polynomic time.
Olexiy
This does not work in all cases try with 10, 15, 20, 25, 30, 45, 75. The result is A=75, 25, 20 and B = 45, 30, 15, 10Sum in A = 120, Sum in B = 100 whereas it is possible to have 110 in each
Patrice Bernassola
+1  A: 

your requirement in #2 needs clarification, because: "The sum of all the numbers in A is as nearly equal as possible to the sum of all the numbers in B" is clear, but then your statement "the distribution should be fair" makes everything unclear. What does 'fair' exactly mean? Does the process need a random element in it?

Jon Schoning
Thanks for your reply Jon. By fair, i mean the same thing, i.e. the sum of numbers in each set should be as nearly equal as possible, and the total numbers in each set differ by atmost 1.
A: 

I assume the numbers are not sequential, and you can't re-balance?

Because of constraint 1, you're going to need to switch buckets every other insertion, always. So every time you're not forced to pick a bucket, pick a logical bucket (where adding the number would make the sum closer to the other bucket). If this bucket isn't the same one as your previous bucket, you get another turn where you're not forced.

psychotik
A: 

Any dual knapsack algorithm will do (regardless of distribution of numbers).

Esteban Araya
An example or references would be great! :-)
fortran
+1  A: 

@ShreevatsaR notes that the algorithm below is known as the greedy algorithm. It does not do very well with certain inputs (I tried 10 different sets of randomly generated sets of inputs of size 100 and in all cases, the sums were very close which led me to think sorting the input was enough for the success of this algorithm).

See also "The Easiest Hard Problem", American Scientist, March-April 2002, recommended by ShreevatsaR.

#!/usr/bin/perl

use strict;
use warnings;

use List::Util qw( sum );

my @numbers = generate_list();

print "@numbers\n\n";

my (@A, @B);
my $N = @numbers;

while ( @numbers ) {
    my $n = pop @numbers;
    printf "Step: %d\n", $N - @numbers;
    {
        no warnings 'uninitialized';

        if ( sum(@A) < sum(@B) ) {
            push @A, $n;
        }
        else {
            push @B, $n;
        }
        printf "A: %s\n\tsum: %d\n\tnum elements: %d\n",
            "@A", sum(@A), scalar @A;
        printf "B: %s\n\tsum: %d\n\tnum elements: %d\n\n",
            "@B", sum(@B), scalar @B;
    }
}

sub generate_list { grep { rand > 0.8 } 1 .. 450 }

Note that generate_list returns a list in ascending order.

Sinan Ünür
This algorithm doesn't guarantee the second constraint as well - for example, for the list {1, 1, 3, 3, 4} it will produce smth like {1, 3, 4}-{1, 3} instead of the correct answer {1, 1, 4} - {3, 3}
Olexiy
@Olexiy: You should try running the code before downvoting. You seem to have missed the fact that I am `pop`ing from the main list.
Sinan Ünür
I missed the fact that you pop from the end of the list - try {4, 3, 3, 1, 1} now
Olexiy
@Olexiy How is it not clear that I am assuming that the input is sorted?
Sinan Ünür
Use {2, 2, 2, 3, 3}. The optimal solution is to split into two sets {2, 2, 2} and {3, 3} both equal to 6, but your code (which I ran) produces {3, 2} and {3, 2, 2}, which are 5 and 7. Your idea is the differencing heuristic for the number-partitioning problem, but although it works for many cases, it is known that it can be very far from optimal for bad cases.
ShreevatsaR
Correction: Your idea is known as the greedy algorithm; "differencing" is a different idea due to Karmarkar and Karp (which does better, but is still not optimal). Further reading: Brian Hayes's article "The Easiest Hard Problem", *American Scientist*, March-April 2002. http://www.americanscientist.org/issues/pub/the-easiest-hard-problem/
ShreevatsaR
@ShreevatsaR Thanks for the constructive input and the link to a great article. Just out of curiosity, can you think of a counter-example where all numbers are unique?
Sinan Ünür
Yes, the same idea works: modify the example (20, 20, 20, 30, 30) slightly to get the new example (19, 20, 21, 29, 31). :-) (Again the optimal is {19, 20, 21} and {29, 31} with 60 each, but this algorithm gives ~50 and ~70... 51 and 69, I think.)
ShreevatsaR
@ShreevatsaR Got it. I should have seen it. (minor correction, 52 and 68).
Sinan Ünür
@ Sinan Ünür: The input is not sorted, your program does not sort it, and I can not read your thoughts - how can I guess that you sort it?
Olexiy
@Olexiy `sub generate_list { grep { rand > 0.8 } 1 .. 450 }` this will select each number from `1` to `450` with probability `0.2`. The list will be sorted because `1` thru `450` is sorted.
Sinan Ünür
Ok, sorry, didn't see that part
Olexiy
+3  A: 

This is a constrained version of the Number Partioning Problem. Usually the goal is to find any 2 disjoint subsets that minimize the difference of the sums. Your problem is constrained in the sense you only consider 1 possiblity: 2 sets of size N/2 (or 1 set of N/2 and one set of N/2+1 if the total number if uneven). This dramatically reduces the search space, but I can't thnk of a good algorithm at the moment, I'll think about it.

Falaina
My calculation shows that there are 1.00891344545564e+29 ways to partition 100 numbers into 2 sets of 50 (the worst case) so any brute force approach is still going to fail! Still as your link says number partitioning is seen as the easiest of the NP complete problems!
Jackson
The Number Partitioning Problem (or the knapsack problem, or the subset-sum problem) is NP-complete only when you define the input size as the "length" of the input, i.e., the number of bits needed to describe the numbers. But there are easy dynamic-programming algorithms that run in time polynomial in the *numbers* themselves (here 1 to 100); so the fact that it is NP-complete shouldn't be a deterrent. (In fact, these problems undergo a "phase transition" when the sizes of the numbers go from polynomial to exponential in n... but 100 is a very small number, so that's of no concern here.)
ShreevatsaR
In short: constrained versions of NP-hard problems aren't necessarily hard. Far from it. :-)
ShreevatsaR
+8  A: 

Since the numbers are small it is not NP-complete.

To solve it you can use dynamic programming:

Make a three-dimensional table of booleans where true at t[s, n, i] means that the sum s can be reached with a subset of n elements below index i. To compute the value for t[s, n, i] check t[s, n, i-1] and t[s - a[i], n-1, i-1]. Then look through the table at second index n/2 to find the best solution.

Edit: You actually don't need the complete table at once. You can make a two dimensional table t_i[s, n] for each index i and compute the table for i from the table for i-1, so you only need two of these two-dimensional tables, which saves a lot of memory. (Thanks to Martin Hock.)

starblue
Olexiy
A problem is NP-complete regardless the size of the input in this case! :-p Another question is if this instance of the problem is feasible given its size
fortran
The constraint for input values (each of them <= 450) makes the problem solvable in a polynomical time. The problem can be stupidly brute-forced in O(N^450) (which is clearly polynomical).A quick N^3 solution is described in another comment.
Olexiy
@fortran: NO! NP-completeness (or even notions like "polynomial time") are defined in terms of the input size. The DP algorithm described here is perfectly correct; its running time is a polynomial function of the numbers in the input (here 1 to 100). The only reason this is not a polynomial-time algorithm for the general Number Partitioning Problem is that *there*, the "input size" is the *number of bits* to describe the input, so the log of the numbers in the input.
ShreevatsaR
+1 Ah, nice solution I knew there had to be a descent psedupolynomial solution with DP, but I couldn't think of the table. Thanks for the insight.
Falaina
You can also do it with a 2D table. t[s, k] true means that the sum s can be reached with some subset of k elements. Your array should be of size [S, ceil(n/2)] where S is the sum of all elements and n is the number of elements. Then iterate through the elements i[1] ... i[n]. For each element i[x], scan through the table and if t[s, n] is true, set t[s+i[x], n+1] true (unless this would fall off the table). Also, set t[i[x], 1] true. Finally, scan through the last column (all x for [x, ceil(n/2)]) for the optimal value.
Martin Hock
+2  A: 

First, find a solution to the problem without the first constraint (i.e. - making sums as close as possible). This problem can be solved using DP approach (you can read more about DP here, and the first problem - about coins - is very similar to yours).

Once you can solve it, you can add one more state to your DP - the number of persons selected to the subset already. This gives you a N^3 algorithm.

Olexiy
This is correct, but a minor point: the algorithm is (N^2*K) where K is the maximum size of a subset (here 100*N), not necessarily N^3 in general.
ShreevatsaR
450*N.Yes, sure, but without that constraint the problem is NP-complete
Olexiy
Yeah, I know. Although 450 is a "constant", it is really more informative to treat it as a parameter and say that the algorithm is O(N^3*K), rather than O(N^3). :-)
ShreevatsaR
A: 

I would give genetic algorithms a try, as this seems a very nice problem to apply them.

The codification is just a binary string of length N, meaning 0 being in the first group and 1 in the second group. Give a negative fitness when the number of elements in each group differs, and a positive fitness when the sums are similar... Something like:

fitness(gen) = (sum(gen)-n/2))^2 + (sum(values[i]*(-1)**gen[i] for i in 0..n))^2

(And minimize the fitness)

Of course this can give you a sub-optimal answer, but for large real world problems it's usually enough.

fortran
+2  A: 

I have an algorithm for you. It is using a lot of recursive and iterative concepts.

Assuming you have n number Xn with 1 <= n <= 100 and 1 <= Xn <= 450.

  1. If n < 3 then distribute numbers and stop algorithm,

  2. If n > 2 then sort your list of number in ascending order,

  3. Compute the total sum S of all numbers,
  4. Then divide the previous total S by (n - n%2)/2 and obtain the A value,
  5. Now we will create couple of numbers which addition will be as near as possible as A. Get the first number and find a second number in order to obtain a sum S1 as near as possible than A. Put S1 in a new list of number and keep in memory how the sum was computed in order to have the base numbers later.
  6. Execute 5. until numbers in the list is < 2. Then put the remaining numbers to the sum list and restart algorithm to point 1. with new list.

Example:

Assuming: n = 7 and numbers are 10, 75, 30, 45, 25, 15, 20

Pass 1:

Since n > 2 so sort the list : 10, 15, 20, 25, 30, 45, 75

Sum S = 220

A = 220 / ((7-1)/2) = 73

Couples:

10 & 75 => 85

15 & 45 => 60

20 & 30 => 50

Remaining numbers are < 2 so add 25 in the sum list : 85(10,75), 60(15,45), 50(20,30), 25(25)

Pass 2:

n = 4 and numbers are 85, 60, 50, 25

List count is > 2 so sort list : 25(25), 50(20,30), 60(15,45), 85(10,75)

Sum S is still the same (S=220) but A must be recompute : A = 220 / ((4-0)/2) = 110

Couples:

25 & 85 => 110

50 & 60 => 110

The Sum list is : 110(25(25),85(10,75)), 110(50(20,30),60(15,45))

Pass 3:

n = 2 and numbers are 110, 110

n < 3 so distribute numbers:

A = 25, 10, 75

B = 20, 30, 15, 45

This works on each scenario I have tested.

Patrice Bernassola
This is a good attempt, but it cannot possibly work for all inputs -- because if this were correct, it would be possible to extend it and get a polynomial-time algorithm for the NP-complete number partitioning problem. This is bound to fail, but I cannot think of a counterexample right now. Do you have some code for this? Try with larger numbers (e.g. with a set of 10 numbers close to 450 that has exactly one good partition).
ShreevatsaR
Actually, the same counterexample that I gave for another solution works: take (19, 20, 21, 29, 31). The sum is 120, so your algorithm will try to find two pairs close to 120/((5-1)/2)=60 each: (19,31) and (20,29). It will then proceed with the list 21(21), 49(20,29), 50(19,31). Next it tries to find pairs with sum 120/((3-1)/2)=120: (21,50(19,31)). So it ends with partitions 71(21,19,31) and 49(20,29), but the optimal partition is 60(29,31) and 60(19,20,21). Your algorithm actually works for this example if the "pairing" is done from the largest number first, but that cannot be generalized :)
ShreevatsaR
A: 

Simulated Annealing can quite quickly find better and better answers. You could keep 1. true while improving the nearness of 2.

Stephen Denne
A: 

If you need the perfect answer then you have to generate and loop through all of the possible sets of answers. If a pretty good answer is all you need then a technique like simulated annealing is the way to go. Heres some C code that uses a very primitive cooling schedule to find an answer.

#include <stdio.h>
#include <stdlib.h>

#define MAXPAR 50
#define MAXTRIES 10000000

int data1[] = {192,130,446,328,40,174,218,31,59,234,26,365,253,11,198,98,
               279,6,276,72,219,15,192,289,289,191,244,62,443,431,363,10
              } ;
int data2[] = { 1,2,3,4,5,6,7,8,9 } ;

// What does the set sum to
int sumSet ( int data[], int len )
{
    int result = 0 ;
    for ( int i=0; i < len; ++i )
     result += data[i] ;
    return result ;
}

// Print out a set
void printSet ( int data[], int len )
{
    for ( int i=0; i < len; ++i )
     printf ( "%d ", data[i] ) ;
    printf ( " Sums to %d\n", sumSet ( data,len ) ) ;
}

// Partition the values using simulated annealing
void partition ( int data[], size_t len )
{
    int set1[MAXPAR] = {0} ; // Parttition 1
    int set2[MAXPAR] = {0} ; // Parttition 2
    int set1Pos, set2Pos, dataPos, set1Len, set2Len ;  // Data about the partitions
    int minDiff ; // The best solution found so far
    int sum1, sum2, diff ;
    int tries = MAXTRIES ; // Don't loop for ever

    set1Len = set2Len = -1 ;
    dataPos = 0 ;

    // Initialize the two partitions
    while ( dataPos < len )
    {
     set1[++set1Len] = data[dataPos++] ;
     if ( dataPos < len )
      set2[++set2Len] = data[dataPos++] ;
    }


    // Very primitive simulated annealing solution
    sum1 = sumSet ( set1, set1Len ) ;
    sum2 = sumSet ( set2, set2Len ) ;
    diff = sum1 - sum2 ; // The initial difference - we want to minimize this
    minDiff = sum1 + sum2 ;
    printf ( "Initial diff is %d\n", diff ) ;

    // Loop until a solution is found or all are tries are exhausted
    while ( diff != 0 && tries > 0 )
    {
     // Look for swaps that improves the difference
     int newDiff, newSum1, newSum2 ;
     set1Pos = rand() % set1Len ;
     set2Pos = rand() % set2Len ;

     newSum1 = sum1 - set1[set1Pos] + set2[set2Pos] ;
     newSum2 = sum2 + set1[set1Pos] - set2[set2Pos] ;
     newDiff = newSum1 - newSum2 ;
     if ( abs ( newDiff ) < abs ( diff ) ||  // Is this a better solution?
             tries/100 > rand() % MAXTRIES )  // Or shall we just swap anyway - chance of swap decreases as tries reduces
     {
      int tmp = set1[set1Pos] ;
      set1[set1Pos] = set2[set2Pos] ;
      set2[set2Pos] = tmp ;
      diff = newDiff ;
      sum1 = newSum1 ;
      sum2 = newSum2 ;

      // Print it out if its the best we have seen so far
      if ( abs ( diff ) < abs ( minDiff ) )
      {
       minDiff = diff ;
       printSet ( set1, set1Len ) ;
       printSet ( set2, set2Len ) ;
       printf ( "diff of %d\n\n", abs ( diff ) ) ;
      }
     }


     --tries ;
    }

    printf ( "done\n" ) ;
}


int main ( int argc, char **argv )
{
    // Change this to init rand from the clock say if you don't want the same
    // results repoduced evert time!
    srand ( 12345 ) ;
    partition ( data1, 31 ) ;
    partition ( data2, 9 ) ;
    return 0;
}
Jackson
Hahaha, solve NP problem with PERL! I've seen everything now! ;-)
Eric M