ansaurus

Question

Creating a formula for calculating device "health" based on uptime/reboots

Answer 1

+5 A:

You could do something like Windows' 7 reliability metric - start out at full health (say 10). Every hour / day / checkin cycle, increment the health by (10 - currenthealth)*incrementfactor). Every time the server goes down, subtract a certain percentage.

So, given a crashfactor of 20%/crash and an incrementfactor of 10%/day:

If a device has rebooted a lot in the past but has not rebooted in the last 20 days will have a health of 8.6
Big uptime except for the last 2 days where it has repeatedly rebooted 5 times will have a health of 4.1
a device that has been up for 30 days and just rebooted will have a health of 8
a device that has continually rebooted every 24 hrs or so for the last 10 days will have a health of 3.9

To run through an example:

Starting at 10
Day 1: no crash, new health = CurrentHealth + (10 - CurrentHealth)*.1 = 10
Day 2: One crash, new health = currenthealth - currentHealth*.2 = 8 But still increment every day so new health = 8 + (10 - 8)*.1 = 8.2
Day 3: No crash, new health = 8.4
Day 4: Two crashes, new health = 5.8

Eclipse 2010-02-01 22:10:04

This is an interesting angle I hadn't thought of. I had forgotten that Win7 has a reliability metric. How would your example function ever get past zero, though? Start at 10, second checkin would be (10 - 10 (current health) * incrementfactor (which could be anything). That still leaves me at zero. Am I missing something?

Todd Brooks 2010-02-01 22:16:42

ok, I like your edits. Any way to get that into a formula? How would the crash factor be determined, or is that an arbitrarily created constant? How does that factor into your original formula?

Todd Brooks 2010-02-01 22:23:00

It means if it's health is 10, it won't get any bigger. And the lower the health, the more you'll get in a uptime

Samuel Carrijo 2010-02-01 22:24:22

I think with some slight modifications to the constant values this will work out perfectly. Many thanks!!

Todd Brooks 2010-02-01 22:26:40

You could come up with a formula, (look up compound interest for examples), but it'd be easier just to iterate over the last xx number of days. The crashfactor and increment factor as well as your maximum health values would be arbitrarily chosen values that you can tune to get the output you want.

Eclipse 2010-02-01 22:27:32

Answer 2

A:

You might take the reboot count / t of a particular machine and compare that to the standard deviation of the entire population. Those that fall say three standard deviations from the mean, where it's rebooting more often, could be flagged.

fatcat1111 2010-02-01 22:10:16

Answer 3

A:

You could use weighted average uptime and include the current uptime only when it would make the average higher.

The weight would be how recent the uptime is, so that most recent uptimes have the biggest weight.

svick 2010-02-01 22:17:47

Answer 4

A:

Does it always report it a runtime of 0, on reboot? Or something close to zero (less then former time anyway)?

You could calculate this two ways. 1. The lower the number, the less troubles it had. 2. The higher the number, it scored the largest periods.

I guess you need to account, that the health can vary. So it can worsen over time. So the latest values, should have a higher weight then the older ones. This could indicate a exponential growth.

The more reboots it had in the last period, the more broken the system could be. But also looking at shorter intervals of the reboots. Let's say, 5 reboots a day vs. 10 reboots in 2 weeks. That does mean a lot different. So I guess time should be a metric as well as the amount of reboots in this formula.

I guess you need to calculate the density of the amount of reboots in the last period.

You can use the weight of the density, by simply dividing. Because how larger the number is, on which you divide, how lower the result will be, so how lower the weight of the number can become.

Pseudo code:

function calcHealth(machine)
float value = 0;
float threshold = 800;

for each (reboot in machine.reboots) {
    reboot.daysPast = time() - reboot.time;

    // the more days past, the lower the value, so the lower the weight
    value += (100 / reboot.daysPast);
}

return (value == 0) ? 0 : (threshold / value);
}

You could advance this function by for example, filtering for maxDaysPast and playing with the threshold and stuff like that.

This formula is based on this plot: f(x) = 100/x. As you see, on low numbers (low x value), the value is higher, then on large x value. So that's on how this formula calculates the weight of the daysPast. Because lower daysPast == lower x == heigher weight.

With the value += this formula counts the reboots and with the 100/x part it gives weight to the reboot, on where the weight is the time.

At the return, the threshold is divided through the value. This is because, the higher the score of the reboots, the lower the result must be.

You can use a plotting program or calculator, to see the bending of the plot, which is also the bending of the weight of the daysPast.

Michiel 2010-02-01 22:27:38

density of reboots would be useful if it were able to be weighted according to time period.

Todd Brooks 2010-02-01 22:35:32

Any way this formula could be bounded? For example, limit it between a MIN and MAX value.

Todd Brooks 2010-02-02 02:33:32

Answer 5

A:

Are you able to break the devices out into groups of similar devices? Then you could compare an individual device to its peers.

Another suggestions is to look in to various Moving Average algorithms. These are supposed to smooth out time-series data as well as highlight trends.

Bryan Batchelder 2010-02-01 22:38:58

ansaurus

tags:

views:

answers:

Creating a formula for calculating device "health" based on uptime/reboots

related questions