ansaurus

Question

Answer 1

+5 A:

Find another way. If you have 200 variables, and each one has at least 2 choices, you're going to have >= 2^200 combinations. If you generate one combination each nanosecond, it would take about 10^43 years to enumerate 2^200 choices.

Keith Randall 2009-10-02 20:54:02

Yes, I am thinking of partitioning the data set into 10-20 chunks and test one chunk at a time. I cannot visualize how good a test it would be.

srini.venigalla 2009-10-02 21:30:22

Answer 2

+3 A:

As Keith pointed out, the number of combinations will be impossibly large if there are no excluded combinations, which would make your need unmeetable. However, since you've already said that you have mutually exclusive choices, the solution space will be smaller.

How much smaller? Depends on how many choices are mutually exclusive. I recommend doing some math on that before going too hard.

Assuming that enough choices are exclusive, you're still going to have to essentially brute-force it, but you're very unlikely to find an existing, useful algorithm.

Which brings me to the question: what's your reason for doing this - exhaustive testing? Sounds good, but you may find that that's not possible. I've encountered this issue myself, and in the end, you may well be forced to some combination of carefully selected "edge" cases, plus some quasi-randomly selected other cases.

Having read your comment above, it appears you define "mutual exclusion" differently than I do, and I fear that you may have a problem.

So a given patient is not both blind and not blind. Great. But that's not what I (and I suspect everyone else here) understood when you mentioned mutual exclusions.

By those, I'm talking about e.g., if blind, cannot be non-disabled, or something like that.

Without a significant number of mutually exclusive inter-relationships between your attributes which limit their combinations, you will be unable to complete your exhaustive testing.

CPerkins 2009-10-02 21:23:23

Please see my comment above as for what I am doing. I have already tested for a few edge cases, but thought there could be an automated way to let a PC run for a few days/weeks with test cases.

srini.venigalla 2009-10-02 21:33:44

So you are looking for exhaustive testing. Please see my recommendation about doing some more math on this - you really need to see how much your combination space is reduced by the mutually exclusive choices, because no matter how many chunks you break your data into, you can't possibly test all the combinations if mutual exclusion isn't a huge factor.

CPerkins 2009-10-02 21:43:16

Testing is one, but deriving a Training Dataset is the second objective. If I have a good training dataset I do not need the PMML rule document. I am looking into reducing the variables first and then grouping them into functions perhaps..

srini.venigalla 2009-10-02 21:48:59

Answer 3

+4 A:

As others have pointed out (and yourself also), it is impossible to test exhaustively this.

I suggest you take the sampling approach, and test with that. You have strong theoretical background, so you will be able to find your way in the internet to find and understand this.

But let me give a small example. For now, I will ignore possible "clusters" of parameters (that are strongly related).

Create a sample of one data, containing all possible values for all your 200 parameters. This exhaustivity ensures that no parameter value could be forgotten.

It doesn't have to be created upfront, the values can be created by a loop.
To each sample of one data, you need to add the other values. A simple approach would be to choose a number of times you want to test each one-sample (say N = 100). For each sample of one data, you would generate randomly N times the other values.

If there are 1000 possible values using all 200 parameters, and N=100, that would give us 100K tests.

You could elaborate on this basic idea in many ways:

If you want your test to be repeatable, you could generate it only once, store it, and then reuse the same sets in all future tests.
You could control your distribution so that each value gets selected a fair number of times.
In real life, all 200 parameters wouldn't be without connections. Many parameters would actually be connected to some others, in that the probability of finding the values together are not even. Instead of making the initial exhaustive set on only one parameter as I did previously,
I would run the exhaustive set on a cluster of connected parameters.

KLE 2009-10-03 06:36:55

Thanks, this will work for me. I am familiar with the Bayesian sampling principle, which matches your suggestion. I can use the same data sets as a training set as well. If possible I will post the code somewhere when I am done. Thanks again.

srini.venigalla 2009-10-03 14:50:13

@KLE - I will vote up as soon as I get 15 points!!

srini.venigalla 2009-10-05 23:57:44

ansaurus

tags:

views:

answers:

Set Combinatorics Algorithm in Java

related questions