ansaurus

Question

Calculating the probability of system failure in a distributed network

Answer 1

+4 A:

For large r and b you can use a method called Monte-Carlo integration, see e.g. Monte Carlo integration on Wikipedia (and/or chapter 3.1.2 of SICP) to compute the sum. For small r and b and significantly different node-failure probabilities p[i] the exact method is superior. The exact definition of small and large will depend on a couple of factors and is best tried out experimentally.

Specific sample code: This is a very basic sample code (in Python) to demonstrate how such a procedure could work:

def montecarlo(p, rb, N):
    """Corresponds to the binomial coefficient formula."""
    import random
    succ = 0

    # Run N samples
    for i in xrange(N):
        # Generate a single test case
        alivenum = 0
        for j in xrange(rb):
            if random.random()<p: alivenum += 1
        # If the test case succeeds, increase succ
        if alivenum >= b: succ += 1
    # The final result is the number of successful cases/number of total cases
    # (I.e., a probability between 0 and 1)
    return float(succ)/N

The function corresponds to the binomial test case and runs N tests, checking if b nodes out of r*b nodes are alive with a probability of failure of p. A few experiments will convince you that you need values of N in the range of thousands of samples before you can get any reasonable results, but in principle the complexity is O(N*r*b). The accuracy of the result scales as sqrt(N), i.e., to increase accuracy by a factor of two you need to increase N by a factor of four. For sufficiently large r*b this method will be clearly superior.

Extension of the approximation: You obviously need to design the test case such, that it respects all the properties of the system. You have suggested a couple of extensions, some of which can be easily implemented while others can not. Let me give you a couple of suggestions:

1) In the case of distinct but uncorrelated p[i], the changes of the above code are minimal: In the function head you pass an array instead of a single float p and you replace the line if random.random()<p: alivenum += 1 by

if random.random()<p[j]: alivenum += 1

2) In the case of correlated p[i] you need additional information about the system. The situation I was referring to in my comment could be a network like this:

A--B--C
   |  |
   D  E
   |
   F--G--H
      |
      J

In this case A might be the "root node" and a failure of node D could imply the automatic failure with 100% probability of nodes F, G, H and J; while a failure of node F would automatically bring down G, H and J etc. At least this was the case I was referring to in my comment (which is a plausible interpretation since you talk about a tree structure of probabilities in the original question). In such a situation you would need modify the code that p refers to a tree structure and for j in ... traverses the tree, skipping the lower branches from the current node as soon as a test fails. The resulting test is still whether alivenum >= b as before, of course.

3) This approach will fail if the network is a cyclic graph that cannot be represented by a tree structure. In such a case you need to first create graph nodes that are either dead or alive and then run a routing algorithm on the graph to count the number of unique, reachable nodes. This won't increase the time-complexity of the algorithm, but obviously the code complexity.

4) Time dependence is a non-trivial, but possible modification if you know the m.t.b.f/r (mean-times-between-failures/repairs) since this can give you the probabilities p of either the tree-structure or the uncorrelated linear p[i] by a sum of exponentials. You will then have to run the MC-procedure at different times with the corresponding results for p.

5) If you merely have the log files (as hinted in your last paragraph) it will require a substantial modification of the approach which is beyond what I can do on this board. The log-files would need to be sufficiently thorough to allow to reconstruct a model for the network graph (and thus the graph of p) as well as the individual values of all nodes of p. Otherwise, accuracy would be unreliable. These log-files would also need to be substantially longer than the time-scales of failures and repairs, an assumptions which may not be realistic in real-life networks.

user8472 2010-06-28 17:20:15

Thanks for your great response! I accepted your answer before the bounty expired but for some reason it says that the bounty was auto-accepted and only half the points where awarded. Sorry about that.

Yrlec 2010-07-01 12:18:43

Answer 2

+1 A:

Assuming each node has a constant, known and independent availability rate, A divide and conquer approach come to mind.

Say you have N nodes.

Split them into two sets of N/2 nodes.
For each side, compute the probability that any number of nodes ([0,N/2]) are down.
Multiply and sum these as needed to find the probability that any number ([0,N]) of the full set is down.

Step 2 can be done recursively and at the top level you can sum as need to find how often more than some number are down.

I don't know the complexity of this but if I has to guess, I'd say at or below O(n^2 log n)

The mechanics of this can be illustrated on a terminal case. Say we have 5 nodes with up times p1...p5 . We can split this into segments A with p1...p2 and B with p3...p5 . We then can process these to find the "N nodes up" times for each segment:

For A:

a_2

a_1

a_0

For B:

b_3

b_2

The final results for this stage can be found by multiplying each of the a's with each of the b's and summing appropriately.

v[0] = a[0]*b[0]
v[1] = a[1]*b[0] + a[0]*b[1]
v[2] = a[2]*b[0] + a[1]*b[1] + a[0]*b[2]
v[3] =             a[2]*b[1] + a[1]*b[2] + a[0]*b[3]
v[4] =                         a[2]*b[2] + a[1]*b[3]
v[5] =                                     a[2]*b[3]

BCS 2010-06-30 04:42:47

1) How can you generalize this approach to the case of distinct failure probabilities? Example for `N=20`: If the probabilities that the three nodes N2, N3 and N7 are down is different from N1, N4 and N5 you still have complexity `O(2^N)` since you need to take into account all those distinct cases. 2) How can you generalize this approach to the case of node correlations? I.e., if a failure of node No. 2 results in a failure of nodes [N/2-1,...,N]? Such a non-locality cannot be handled efficiently in a recursive algorithm.

user8472 2010-06-30 10:12:35

Your example case for A contains four terms, corresponding to a combination of four different cases leading to three possible outcomes. Complexity is thus `2^2=4`. Case B corresponds to four possible outcomes; had you explicitly written it down you would have b0, b1, b2, b3 with altogether `2^3=8` individual terms, each representing one of those eight cases. "Multiplying each of the `a`'s with each of the `b`'s and summing appropriately" generates six possible outcomes with altogether `2^5=32` terms. Thus, the complexity of your proposal is identical to the one in the original question.

user8472 2010-06-30 21:44:02

@user8472: Multiplying each of the `a`'s with each of the `b`'s and summing appropriately generates six possible outcomes with altogether '4*3=12' terms. `a0*b0 ... a0*b3, a1*b0 ... a1*b3, a2*b0 ... a2*b3` for a much reduced complexity, and the improvement gets better higher up.

BCS 2010-06-30 22:26:33

@BCS: I see how your approach performs a memoization of several terms. `a1` was two terms and `b1` three, so `v1` would actually be five "terms", not two. However, since you use those same combinations several times for `v1`...`v4` you can get away with computing them only once. Nonetheless, I don't see how this approach could generalize to correlations since (see e.g. the network in my suggestion 3 for nodes A-E) in that case a failure of node B would non-locally cut off all contributions from `b0`...`b3` and the assumption of independence underlying your memoization would be incorrect.

user8472 2010-07-01 02:32:33

@user8472: I thought my description was rather explicit in that it reused terms, oh well. --- It only works for independent e.i. uncorrelated systems. OTOH for some systems, the splitting portion could be selected to match the correlations. Even if that can only be done a little bit, it could reduce it to a `O(b*2^(n/b)) problem.

BCS 2010-07-01 03:04:54

@BCS: Yes, I agree. In some cases the complexity of correlated systems can be reduced by a polynomial factor (i.e., it might still be exponential, but with a favorable parameter). However, I don't see yet how the suggestion in your last comment gives rise to a specific algorithm to achieve it. The advantage of your suggestion is that the solution is exact.On the other hand, the question poster has allowed an approximate solution if it is practical. For large systems (`r*b` > 1000s` or more) I don't see (yet) how an exact algorithm is practical.

user8472 2010-07-02 08:10:23

@user8472: With a little manual setup, it might be possible to apply it. However I also don't see a way to automate it.

BCS 2010-07-02 14:38:42

ansaurus

tags:

views:

answers:

Calculating the probability of system failure in a distributed network

related questions