views:

308

answers:

2
+2  Q: 

Subset sum problem

I'm having a problem with counting which is continuation of this question. I am not really a math person so it's really hard for me to figure out this subset sum problem which was suggested as resolution.

I'm having 4 ArrayList in which I hold data: alId, alTransaction, alNumber, alPrice

Type | Transaction | Number | Price
8 | Buy | 95.00000000 | 305.00000000
8 | Buy | 126.00000000 | 305.00000000
8 | Buy | 93.00000000 | 306.00000000
8 | Transfer out | 221.00000000 | 305.00000000
8 | Transfer in | 221.00000000 | 305.00000000
8 | Sell | 93.00000000 | 360.00000000
8 | Sell | 95.00000000 | 360.00000000
8 | Sell | 126.00000000 | 360.00000000
8 | Buy | 276.00000000 | 380.00000000

In the end I'm trying to get what's left for customer and what's left I put into 3 array lists:
- alNewHowMuch (corresponds to alNumber),
- alNewPrice (corresponds to alPrice),
- alNewInID (corrseponds to alID)

        ArrayList alNewHowMuch = new ArrayList();
        ArrayList alNewPrice = new ArrayList();
        ArrayList alNewInID = new ArrayList();
        for (int i = 0; i < alTransaction.Count; i++) {
            string transaction = (string) alTransaction[i];
            string id = (string) alID[i];
            decimal price = (decimal) alPrice[i];
            decimal number = (decimal) alNumber[i];
            switch (transaction) {
                case "Transfer out":
                case "Sell":
                    int index = alNewHowMuch.IndexOf(number);
                    if (index != -1) {
                        alNewHowMuch.RemoveAt(index);
                        alNewPrice.RemoveAt(index);
                        alNewInID.RemoveAt(index);
                    } else {
                        ArrayList alTemp = new ArrayList();
                        decimal sum = 0;
                        for (int j = 0; j < alNewHowMuch.Count; j ++) {
                            string tempid = (string) alNewInID[j];
                            decimal tempPrice = (decimal) alNewPrice[j];
                            decimal tempNumbers = (decimal) alNewHowMuch[j];
                            if (id == tempid && tempPrice == price) {
                                alTemp.Add(j);
                                sum = sum + tempNumbers;
                            }
                        }
                        if (sum == number) {
                            for (int j = alTemp.Count - 1; j >= 0; j --) {
                                int tempIndex = (int) alTemp[j];
                                alNewHowMuch.RemoveAt(tempIndex);
                                alNewPrice.RemoveAt(tempIndex);
                                alNewInID.RemoveAt(tempIndex);
                            }
                        }
                    }
                    break;
                case "Transfer In":
                case "Buy":
                    alNewHowMuch.Add(number);
                    alNewPrice.Add(price);
                    alNewInID.Add(id);
                    break;
            }
        }

Basically I'm adding and removing things from Array depending on Transaction Type, Transaction ID and Numbers. I'm adding numbers to ArrayList like 156, 340 (when it is TransferIn or Buy) etc and then i remove them doing it like 156, 340 (when it's TransferOut, Sell). My solution works for that without a problem. The problem I have is that for some old data employees were entering sum's like 1500 instead of 500+400+100+500. How would I change it so that when there's Sell/TransferOut or Buy/Transfer In and there's no match inside ArrayList it should try to add multiple items from thatArrayList and find elements that combine into aggregate.

Inside my code I tried to resolve that problem with simple summing everything when there's no match (index == 1)

                    int index = alNewHowMuch.IndexOf(number);
                    if (index != -1) {
                        alNewHowMuch.RemoveAt(index);
                        alNewPrice.RemoveAt(index);
                        alNewInID.RemoveAt(index);
                    } else {
                        ArrayList alTemp = new ArrayList();
                        decimal sum = 0;
                        for (int j = 0; j < alNewHowMuch.Count; j ++) {
                            string tempid = (string) alNewInID[j];
                            decimal tempPrice = (decimal) alNewPrice[j];
                            decimal tempNumbers = (decimal) alNewHowMuch[j];
                            if (id == tempid && tempPrice == price) {
                                alTemp.Add(j);
                                sum = sum + tempNumbers;
                            }
                        }
                        if (sum == number) {
                            for (int j = alTemp.Count - 1; j >= 0; j --) {
                                int tempIndex = (int) alTemp[j];
                                alNewHowMuch.RemoveAt(tempIndex);
                                alNewPrice.RemoveAt(tempIndex);
                                alNewInID.RemoveAt(tempIndex);
                            }
                        }
                    }

But it only works if certain conditions are met, and fails for the rest.

Edit: Since some of you were so astonished (and blinded) by my polish variable names i translated all of them to english for simplicity and visiblity. Hopefully this will help me to get some help :-)

+2  A: 

How you should do this depends on a number important things: how many numbers will you have and how big will they be? Also, as far as I understand, your data can change (add / remove numbers etc.), right?. How often do you need to make these queries?

I'll present two solutions. I suggest you use the second, as I suspect it's better for what you need and it's a lot easier to understand.

Solution 1 - dynamic programming

Let S[i] = true if we can make sum i and false otherwise.

S[0] = true // we can always make sum 0: just don't choose any number
S[i] = false for all i != 0
for each number i in your input
    for s = MaxSum downto i
        if ( S[s - i] == true )
            S[s] = true; // if we can make the sum s - i, we can also make the sum s by adding i to the sum s - i.

To get the actual numbers that make up your sum you should keep another vector P[i] = the last number that was used to make sum i. You would update this accordingly in the if condition above.

The time complexity of this is O(numberOfNumbers * maxSumOfAllNumbers), which is pretty bad, especially since you have to rerun this algorithm whenever your data changes. It's also slow for even one run as long as your numbers can be very big and you can have a lot of them. In fact, "a lot" is misleading. If you have 100 numbers and each number can be as big as 10 000, you will do roughly 100 * 10 000 = 1 000 000 operations each time your data changes.

It's a good solution to know, but not really useful in practice, or at least not in your case I think.

He's some C# for the approach I suggest:

   class Program
      {
        static void Main(string[] args)
        {
            List<int> testList = new List<int>();

            for (int i = 0; i < 1000; ++i)
            {
                testList.Add(1);
            }

            Console.WriteLine(SubsetSum.Find(testList, 1000));

            foreach (int index in SubsetSum.GetLastResult(1000))
            {
                Console.WriteLine(index);
            }
        }
    }

    static class SubsetSum
    {
        private static Dictionary<int, bool> memo;
        private static Dictionary<int, KeyValuePair<int, int>> prev;

        static SubsetSum()
        {
            memo = new Dictionary<int, bool>();
            prev = new Dictionary<int, KeyValuePair<int, int>>();
        }

        public static bool Find(List<int> inputArray, int sum)
        {
            memo.Clear();
            prev.Clear();

            memo[0] = true;
            prev[0] = new KeyValuePair<int,int>(-1, 0);

            for (int i = 0; i < inputArray.Count; ++i)
            {
                int num = inputArray[i];
                for (int s = sum; s >= num; --s)
                {
                    if (memo.ContainsKey(s - num) && memo[s - num] == true)
                    {
                        memo[s] = true;

                        if (!prev.ContainsKey(s))
                        {
                            prev[s] = new KeyValuePair<int,int>(i, num);
                        }
                    }
                }
            }

            return memo.ContainsKey(sum) && memo[sum];
        }

        public static IEnumerable<int> GetLastResult(int sum)
        {
            while (prev[sum].Key != -1)
            {
                yield return prev[sum].Key;
                sum -= prev[sum].Value;
            }
        }
    }

You should do some error checking perhaps, and maybe store the last sum in the class so as not to allow the possibility of calling GetLastResult with a different sum than the sum Find was last called with. Anyway, this is the idea.

Solution 2 - randomized algorithm

Now, this is easier. Keep two lists: usedNums and unusedNums. Also keep a variable usedSum that, at any point in time, contains the sum of all the numbers in the usedNums list.

Whenever you need to insert a number into your set, also add it to one of the two lists (doesn't matter which, but do it randomly so there's a relatively even distribution). Update usedSum accordingly.

Whenever you need to remove a number from your set, find out which of the two lists it's in. You can do this with a linear seach as long as you don't have a lot (this time a lot means over 10 000, maybe even 100 000 on a fast computer and assuming you don't do this operation often and in fast succession. Anyway, the linear search can be optimized if you need it to be.). Once you have found the number, remove it from the list. Update usedSum accordingly.

Whenever you need to find if there are numbers in your set that sum to a number S, use this algorithm:

while S != usedSum
    if S > usedSum // our current usedSum is too small
        move a random number from unusedNums to usedNums and update usedSum
    else // our current usedSum is too big
        move a random number from usedNums to unusedNums and update usedSum

At the end of the algorithm, the list usedNums will give you the numbers whose sum is S.

This algorithm should be good for what you need, I think. It handles changes to the dataset very well and works well for a high number count. It also doesn't depend on how big the numbers are, which is very useful if you have big numbers.

Please post if you have any questions.

IVlad
So I was thinking: if you multiply all numbers in the problem by 10, then it is still the same problem so there is no reason why the algorithm should take longer to solve it. However, your DP algorithm does take a lot longer.I think you can fix this by using a hash table instead of an array, and filling it the other way around.
Jules
Here is a python solution for that idea: http://pastebin.com/mBirSUeZ If you call `subsetsum([1,2,3,4],6)` it's the same speed as `subsetsum([100,200,300,400],600)`.
Jules
Added a short explanation: http://pastebin.com/pyHEXVVK
Jules
The problem has no polynomial time solution, the DP solution is pseudopolynomial, so it does depend on the value of the numbers. Of course those two are the same speed, they're both very small and similar inputs. Also, your algorithm is a bit different than mine. You only attempt to give the answer for the requested sum, while I attempt to build a table with the answers for all the sums. Of course, yours is better for this problem instance, but has the same complexity. Consider a call `sub([1, 1, 1, 1, ...], <num ones>)`. This should make it painfully slow for a lot of ones.
IVlad
It should do about a million operations for as little as 1000 ones. Also, I can't run your python code to play with it. I get an error at the `@memoize` line, is that supposed to be removed or implemented manually somehow? I'm not a python guy.
IVlad
Oh sorry I forgot to include the definition of memoize because it's in my standard tools. Here it is: http://pastebin.com/Ur8B7h0nActually for the only ones input this one is polynomial time :)
Jules
For example for 200 ones it's almost instantaneous.
Jules
I will try to fight it tonight or tommorow, and will see how it goes. When ill have something i'll update main post. Thanks IVlad.
MadBoy
The @memoize annotation adds dynamic programming to the naive recursive version. If you remove it it will take very, very long. My program runs exactly as fast on [1,1,1,1] as on [1000,1000,1000,1000], however the bottom up table building dynamic programming algorithm will take much longer on the latter.
Jules
@Jules: true, my C# implementation of your method (check my updated post) also finishes almost immediately for 200. Try 1000 though, that one already takes a few clear seconds, and 1000 isn't even that big of a number. Try 10 000 and enjoy a stack overflow error :). It's an elegant solution, and it might be good for what the OP needs if he only has ~100 numbers, but generally I think the randomized method is better. Randomized also makes it easier to print the solution.
IVlad
My needs are between 2 - 20 numbers being added up. Hopefully this will work :-) Thanks Guys, really appreciate it! WIll report back with progress.
MadBoy
Perhaps it's a good idea to switch the cases around: first try to choose the current number and only after that try not to choose it. This helps because your check `if(sum<0) return false` is then able to prune this path quickly. Of course then there will be other cases that will be slow, but perhaps this helps on average?
Jules
@MadBoy: it doesn't matter how many numbers being added up you have, but how many total numbers you have. If those are also within a similar limit, then the recursive function is good enough. @Jules: maybe, it's worth testing.
IVlad
IVlad it seems that Tuple is part of .NET 4.0. Is it possible to do it without Tuple?
MadBoy
I'm having .NET 3.5 project and wanted to wait with upgrading to .NET 4.0 a bit.
MadBoy
You can try using a KeyValuePair or implementing your own Pair class. For example: public class Pair<T, U> { public Pair() { } public Pair(T first, U second) { this.First = first; this.Second = second; } public T First { get; set; } public U Second { get; set; }};
IVlad
Sorry, you must override `GetHashCode` and `Equals` in the class as well if you want to make your own. You can get away with replacing each `Tuple<>` with `KeyValuePair<>` though.
IVlad
The only problem I have it returns true or false and not list id's in List (or Array) of numbers that were used ;-) Also using Tuple is a lot faste rthen KeyValuePair <>
MadBoy
Yes, unfortunately Tuple is a lot faster. Even implementing my own class I couldn't get it to be as fast. Maybe it's fast enough for the amount of numbers you can have? I'm guessing it has something to do with the hash function, try messing around with the value returned by the overridden `GetHashFunction` in your custom class. As for returning a list, I will post an implementation of that in a few hours as I have to get to class now. I hope it's not urgent. If you want to try it yourself, the idea is to backtrack based on the values in the dictionary and your numbers array.
IVlad
I waited for long time, it can wait another couple of hours ;) Thanks for your help. I appreciate it a lot!
MadBoy
Ok, I changed the implementation a bit. I gave up on the recursive solution because it was slow even with tuples and it made it harder to get the indexes you need. Now it's iterative, finishes instantly for even 1000 numbers, is not prone to stack overflow errors and allows you to easily get more than just a true/false answer. It should compile in .net 3.5 with no problems.
IVlad
Thanks, you're the best. I'll take a look and report. Probably tommorow, got some stuff to fix first :(
MadBoy
I've been testing it a bit and if it's a false exception is thrown as below. Other then that it seems very fine!Unhandled Exception: System.Collections.Generic.KeyNotFoundException: The givekey was not present in the dictionary. at System.Collections.Generic.Dictionary`2.get_Item(TKey key) at ConsoleApplication3.SubsetSum.<GetLastResult>d__0.MoveNext() in C:\Projes\Project.C.Sharp\ConsoleApplication3\ConsoleApplication3\Program.cs:line 53 at ConsoleApplication3.Program.Main(String[] args) in C:\Projects\Project.Charp\ConsoleApplication3\ConsoleApplication3\Program.cs:line 22
MadBoy
Also what would happen if I have 2 loops in a row asking for same number like 300 (since ArrayList will hold diffrent elements)? Would it show correct values?
MadBoy
If it's false you're not supposed to call `GetLastResult`, so that's to be expected. You can either have `GetLastResult` return a special value if the last result was false or make sure not to call it. To have it return a special value, use the `ContainsKey` method like I did in another method. You can definitely improve the class, I just focused on the algorithm rather than the design.
IVlad
The state of the class resets itself before every call, so everything starts from scratch. It should return correct answers, yes.
IVlad
Thanks IVlad. I know what to do now ;-) Lets see if it works with my change.
MadBoy
You can get rid of the keyvaluepairs by noticing that the memo is only sparse in the second value in the keyvaluepairs (the sum). So instead of using a Dictionary of (index,sum) -> bool you can use an array of HashSet. Then lookup memo[index] returns a HashSet containing all sums that can be made with the numbers up to index. I suspect that this will be a lot faster.
Jules
Actually perhaps it's better to use an array instead of HashSet. I will code that up later.
Jules
So I was thinking that maybe we can use a divide and conquer algorithm but that didn't turn out to be useful. However the last step in a divide and conquer algorithm, determining whether `target in {a+b | a in A, b in B}` can be computed in O(|A|+|B|) instead of O(|A|*|B|). The resulting algorithm is O(2^(n/2)) =~ O(1.4^n). The previous algorithms were O(2^n), so this is a huge improvement.
Jules
Preliminary result: the 1000 ones problem with target 1000 is solved in 20 milliseconds.
Jules
@Jules: it's solved just as fast with the solution I posted. I think we might as well avoid recursion in this case, since the iterative solution is pretty clean. Using a HashSet in the iterative implementation, or even a plain array, might indeed be faster.
IVlad
Yes I did mean HashSet in the iterative version. See my post below for an improved algorithm. It's a bit more complicated but much faster.
Jules
+2  A: 

Here is my algorithm. It runs in O(2^(n/2)) and solves SubsetSum(1000, list-of-1000-ones) in 20 milliseconds. See the comments at the end of IVlad's post.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;

namespace SubsetSum
{
    class Program
    {
        static void Main(string[] args)
        {
            var ns = new List<int>();
            for (int i = 0; i < 1000; i++) ns.Add(1);
            var s1 = Stopwatch.StartNew();
            bool result = SubsetSum(ns, 1000);
            s1.Stop();
            Console.WriteLine(result);
            Console.WriteLine(s1.Elapsed);
            Console.Read();
        }

        static bool SubsetSum(ist<int> nums, int targetL)
        {
            var left = new List<int> { 0 };
            var right = new List<int> { 0 };
            foreach (var n in nums)
            {
                if (left.Count < right.Count) left = Insert(n, left);
                else right = Insert(n, right);
            }
            int lefti = 0, righti = right.Count - 1;
            while (lefti < left.Count && righti >= 0)
            {
                int s = left[lefti] + right[righti];
                if (s < target) lefti++;
                else if (s > target) righti--;
                else return true;
            }
            return false;
        }

        static List<int> Insert(int num, List<int> nums)
        {
            var result = new List<int>();
            int lefti = 0, left = nums[0]+num;
            for (var righti = 0; righti < nums.Count; righti++)
            {

                int right = nums[righti];
                while (left < right)
                {
                    result.Add(left);
                    left = nums[++lefti] + num;
                }
                if (right != left) result.Add(right);
            }
            while (lefti < nums.Count) result.Add(nums[lefti++] + num);
            return result;
        }
    }
}

And here is an improved version that prunes the sets:

static bool SubsetSum(List<int> nums, int target)
{
    var remainingSum = nums.Sum();
    var left = new List<int> { 0 };
    var right = new List<int> { 0 };
    foreach (var n in nums)
    {
        if (left.Count == 0 || right.Count == 0) return false;
        remainingSum -= n;
        if (left.Count < right.Count) left = Insert(n, left, target - remainingSum - right.Last(), target);
        else right = Insert(n, right, target - remainingSum - left.Last(), target);
    }
    int lefti = 0, righti = right.Count - 1;
    while (lefti < left.Count && righti >= 0)
    {
        int s = left[lefti] + right[righti];
        if (s < target) lefti++;
        else if (s > target) righti--;
        else return true;
    }
    return false;
}

static List<int> Insert(int num, List<int> nums, int min, int max)
{
    var result = new List<int>();
    int lefti = 0, left = nums[0]+num;
    for (var righti = 0; righti < nums.Count; righti++)
    {

        int right = nums[righti];
        while (left < right)
        {
            if (min <= left && left <= max) result.Add(left);
            left = nums[++lefti] + num;
        }
        if (right != left && min <= right && right <= max) result.Add(right);
    }
    while (lefti < nums.Count)
    {
        left = nums[lefti++] + num;
        if (min <= left && left <= max) result.Add(left);
    } 
    return result;
}

This last one solved the problem with 100000 ones in about 5 milliseconds (but this is a best case of the algorithm, with real world data it will be slower).

For your use this algorithm is probably fast enough (and I don't see any obvious improvements). If you enter 10,000 products with a random price between 0 and 20 and your goal is to sum to 500 this is solved in 0.04 seconds on my laptop.

Edit: I just read on Wikipedia that the best known algorithm is O(2^(n/2)*n). This one is O(2^(n/2)). Do I get the Turing Award?

Jules
Thanks, Nice effort :)
MadBoy
Why do you say it's `O(2^(n/2))`? You traverse your entire `nums` list in each call to `Insert`. I think it's pointless to do such tests really, because it's easy to find one that each algorithm (pseudopolynomial or exponential) will fail one. You found one that the pseduopolynomial algo fails on (takes a long time on): 100000. Here's one that your algo takes a long time on too: 10000 numbers: 1 2 3 4 ... 10000. Search for 345600. Also, you're only printing true or false, I think printing the numbers too will add some overhead as well. Anyway, this does seem faster than the DP so +1, however..
IVlad
However, if we're going to battle with such high numbers, let me implement my randomized algorithm when I get back from college :). I think that is better if we're dealing with very high numbers.
IVlad
I say it's O(2^(n/2)) because it essentially splits the input list in two (so n/2 elements), then computes two sorted subset sum lists for those O(2^(n/2)), and then it finds whether the target can be made from these two lists which is another O(2^(n/2)) operation. So in total it's O(2^(n/2)) =~ O(1.4^n). The previous algorithms were O(2^n).
Jules
Of course you can find an input where it will take a long time. The point is that the other algorithms will take even longer on smaller inputs ;) You're right that printing the numbers will add substantial overhead.
Jules
Oh, the previous algorithms are actually *faster* on the test case you mention. That's strange...
Jules