views:

309

answers:

3

I have a data set with attributes like this:

Marital_status = {M,S,W,D}
IsBlind = {Y,N}
IsDisabled = {Y,N}
IsVetaran = {Y,N}

etc. There are about 200 such variables.

I need an algorithm to generate combinations of the attributes, with one value at a time.

In other words, my first combination would be:

Marital_status = M, IsBlind = Y, IsDisabled = Y, IsVeteran = Y

The next set would be:

Marital_status = M, IsBlind = Y, IsDisabled = Y, IsVeteran = N

I tried to use a simple combination generator, treating each value for each attribute as an attribute itself. It did not work because the mutually exclusive choices are included in the combinations and the number of possible combinations was really huge (133873417996074857185490633899939406700260683726864088366400 to be precise)

Could you please suggest an algorithm (preferably coded in Java)?

Thanks!!

+5  A: 

Find another way. If you have 200 variables, and each one has at least 2 choices, you're going to have >= 2^200 combinations. If you generate one combination each nanosecond, it would take about 10^43 years to enumerate 2^200 choices.

Keith Randall
Yes, I am thinking of partitioning the data set into 10-20 chunks and test one chunk at a time. I cannot visualize how good a test it would be.
srini.venigalla
+3  A: 

As Keith pointed out, the number of combinations will be impossibly large if there are no excluded combinations, which would make your need unmeetable. However, since you've already said that you have mutually exclusive choices, the solution space will be smaller.

How much smaller? Depends on how many choices are mutually exclusive. I recommend doing some math on that before going too hard.

Assuming that enough choices are exclusive, you're still going to have to essentially brute-force it, but you're very unlikely to find an existing, useful algorithm.

Which brings me to the question: what's your reason for doing this - exhaustive testing? Sounds good, but you may find that that's not possible. I've encountered this issue myself, and in the end, you may well be forced to some combination of carefully selected "edge" cases, plus some quasi-randomly selected other cases.

Having read your comment above, it appears you define "mutual exclusion" differently than I do, and I fear that you may have a problem.

So a given patient is not both blind and not blind. Great. But that's not what I (and I suspect everyone else here) understood when you mentioned mutual exclusions.

By those, I'm talking about e.g., if blind, cannot be non-disabled, or something like that.

Without a significant number of mutually exclusive inter-relationships between your attributes which limit their combinations, you will be unable to complete your exhaustive testing.

CPerkins
Please see my comment above as for what I am doing. I have already tested for a few edge cases, but thought there could be an automated way to let a PC run for a few days/weeks with test cases.
srini.venigalla
So you are looking for exhaustive testing. Please see my recommendation about doing some more math on this - you really need to see how much your combination space is reduced by the mutually exclusive choices, because no matter how many chunks you break your data into, you can't possibly test all the combinations if mutual exclusion isn't a huge factor.
CPerkins
Testing is one, but deriving a Training Dataset is the second objective. If I have a good training dataset I do not need the PMML rule document. I am looking into reducing the variables first and then grouping them into functions perhaps..
srini.venigalla
+4  A: 

As others have pointed out (and yourself also), it is impossible to test exhaustively this.

I suggest you take the sampling approach, and test with that. You have strong theoretical background, so you will be able to find your way in the internet to find and understand this.


But let me give a small example. For now, I will ignore possible "clusters" of parameters (that are strongly related).

  • Create a sample of one data, containing all possible values for all your 200 parameters. This exhaustivity ensures that no parameter value could be forgotten.

    It doesn't have to be created upfront, the values can be created by a loop.

  • To each sample of one data, you need to add the other values. A simple approach would be to choose a number of times you want to test each one-sample (say N = 100). For each sample of one data, you would generate randomly N times the other values.

If there are 1000 possible values using all 200 parameters, and N=100, that would give us 100K tests.


You could elaborate on this basic idea in many ways:

  • If you want your test to be repeatable, you could generate it only once, store it, and then reuse the same sets in all future tests.
  • You could control your distribution so that each value gets selected a fair number of times.
  • In real life, all 200 parameters wouldn't be without connections. Many parameters would actually be connected to some others, in that the probability of finding the values together are not even. Instead of making the initial exhaustive set on only one parameter as I did previously,
    I would run the exhaustive set on a cluster of connected parameters.
KLE
Thanks, this will work for me. I am familiar with the Bayesian sampling principle, which matches your suggestion. I can use the same data sets as a training set as well. If possible I will post the code somewhere when I am done. Thanks again.
srini.venigalla
@KLE - I will vote up as soon as I get 15 points!!
srini.venigalla