views:

632

answers:

5

Here's a problem that I seem to be running into working with an accounting system.

I have a set of transactions, but their sum does not equal the amount that the accounting department thinks that it should. They are not questioning the math, just the transactions being included :p

Is there an algorithm that would help me determine which transactions in the set should not be included in order for the sum to match a given amount.

Given Set:  
2  
4  
5  
7

Given Sum Amount:
13

Result Set:
2
4
7

Edit: There's less than 100 transactions in the set. Does anyone have a C# example as there is not one on the Solving the NP-complete problem in XKCD question?

Man, I should have gotten a CS degree.

+6  A: 

This is the Knapsack Problem and it's NP-Complete. You won't easily solve it exactly with anything except small input sets. For any decent-sized problem set, it's one of those lifetime-of-the-universe-to-solve problems.

That said, there are genetic-algorithm knapsack solvers out there.

Aric TenEyck
+3  A: 

This is a version of the knapsack problem. It's NP complete, so you're not going to get a good general answer. How big are your sets of transactions? Is it 5 like you showed, or is it more like 500?

Tyler McHenry
+6  A: 

This is the Subset Sum problem, which is NP-Complete. But that doesn't mean there isn't an algorithm for finding a subset sum.

Sev
+1 for providing a more precise answer than everyone else. The Subset Sum problem is a special case of the knapsack problem.
Brian
But this *is* the Knapsack Problem. He's not just asking whether or not the subset exists, but *exactly* what that subset is.
Aric TenEyck
My understanding is that Subset Sum is an exact match, while more generalized Knapsack is finding a maximum amount.
Even Mien
+4  A: 

As the above members have said, this is the Subset Sum problem (or knapsack problem). However, to say it cannot be done efficiently is not very precise. It can be done, just not in polynomial time. THe solution is actually quite simple using dynamic programming and recursion (and in pseudo-polynomial time).

Given integers [a_1, ... ,a_n] and a number T,

Define the array S[i,k] to denote if there is a subset of elements between a_1, ... a_i which add up to k. (This is a binary matrix).

We can then define a recursive relationship as follows:

S[i,k] = S[i-1,k] or S[i-1,k-a_j]

In words, this means we either use element a_i or we do not. The answer will be located at S[n,T].

What is the work load to construct matrix S? Well, S has n*T elements. To compute each element, we must do O(1) work. So the complete running time is O(n*T).

Now at this point, it seems that I have proven P=NP, as this seems to be a polynomial time algorithm. However, remember that we measure input size in binary, so T = 2^p for some p.

I don't think anyone would say that the above solution, when implemented properly is unreasonable. IN fact, for many reasonable problem sizes, it will perform admirably.

Also, there are some heuristics for solving this problem, but I am not an expert in that area.

SplittingField
+2  A: 

OK. Lots of people have given the name of the problem and mentioned how NP hard it is . And in general, they are correct. However, you have a very specific case you need to solve. The first question to ask is whether or not you think your 100 transactions are close to being the right ones. You have some total ("your" total). They have some total. ("real" total). Some of your transactions are therefore bogus. If you suspect that there are only a few bogus transactions in there, then this isn't so bad. For example, let's consider the case where there is only one bogus transaction. In that case, we'd only have to check 100 different numbers. If there are 2 bogus trans, then you're looking at 100*99 checks, and things will get crazy at 4 bogus trans, with almost 100,000,000 steps. though if all you're doing is adding some int's that's not too terrible.

Another possible shortcut is to examine the nature of your data (incidentally, is it possible to post the 100 "numbers" and the expected sum?). How much is your sum over the real sum? Are there any values so big that eliminating them would make your sum suddenly lower than the real sum? If so, you know those values cannot be the bogus ones. For example, in your example, 7 is absolutely required.

Peter Recore
Thanks. I was thinking that about how to reduce the subset on the commute home yesterday. For example, the set to work with can be reduced down to a set of transactions of amounts less than the difference between the original set total and expected total. Then find the transactions in the reduced set whose sum matches that difference.
Even Mien