views:

101

answers:

3

Setup:

I have a boolean matrix e.g.

1  0
1  1
0  1

where rows are named m1, m2, m3 (as methods) and columns t1 and t2 (as tests).

Definition:

An explanation is a set of rows (methods) which in union have at least one "1" in any column (every test hast to be matched by at least one method).

in our example, the set of explanations would be:

{
  {m2}, 
  {m1,m2},{m1,m3},{m2,m3},
  {m1,m2,m3}
}

Problem:

I want now to calculate all explanations.

Now, I already have two implementations which can solve the problem, one searching explanations top-down, the other bottom-up, but both suffer in exponential growth in computing time (doubles by increasing the number of rows by one).

Is this a known (maybe solved efficiently) mathematical problem?

What could make things easier is that at the end, I only need the number of occurrences in explanations for every method. In our example this would be for m1 three occurrences, for m2 four occurrences and for m3 three occurrences.

My current algorithms work fine until let's say 26 rows. But further on, it is getting very slow.

Thank you for your help!

A: 

I do not know whether there are not an exponential number of possible explanations (which would mean that you cannot enumerate them faster than exponential).

However, you could approach this in a dynamic programming style in order to eliminate duplicate effort:

  • The first level is a list of a single element: the set of all methods
  • loop for each level:
    • loop for each set in this level:
      • if this set is an explanation (oring their tests together gives all 1s):
        • put it into the result list
        • create all possible subsets of this set that have exactly one method less and put them into the next "level"
    • remove all duplicates from the next level
  • until the level is empty
Svante
+1  A: 

If you can settle for approximate probabilities and want something scalable, Gibbs sampling might work. The basic idea is pretty simple: start with the all-rows explanation and repeat the following to sample a bunch of explanations.

  1. Choose a random row.
  2. Flip a coin.
  3. If the coin came up heads, add the row to the explanation (do nothing if it's already there).
  4. If the coin came up tails, attempt to remove the row from the explanation. If the result is not an explanation, put the row back.

In the limit, the fraction of the samples containing a given row converges to its true value. There are some practical implementations under the keywords "Bayesian inference using Gibbs sampling" (you have a uniform prior and observe that for each column, the disjunction of the rows incident with it is true). Since I'm not an expert in this stuff, though, I can't advise you as to the hazards of rolling your own.

+1  A: 

I think this is likely to be an exponential problem. For example if one of the methods has a one in every column, then any subset of methods containing that method is an explanation, and so if there are M methods there are at least 2^(M-1) explanations; similarly if some pair of methods together have a one in any column, then there are at least 2^(M-2) explanations.

Here is method that, while still exponential, I think is faster than enumerating all the explanations, particularly when there are methods with many 1s.

Let T(A,B) be the number of subsets of A (a set of methods) which have at least one 1 in each column in B (a set of columns).

If B is empty, T(A,B) is the number of subsets of A, i.e. 2^#A, where A has #A elements. Otherwise If A is empty, T(A,B) is 0. Otherwise if i is an element of A (e.g. the first one),

T(A,B) = T(A \ {i}, B \ m[i]) + T( A \ {i}, B)

(here A \ {i} is A without i, B \ m[i] is B without any of the columns in method i)

T can be coded quite succinctly as a recursive function.

Finally c[j], the number of times method j occurs in an explanation, is

c[j] = T(A \ {j}, C \ m[j])

where C is the set of all columns.

dmuir