views:

216

answers:

5

I have a few hundred network devices that check in to our server every 10 minutes. Each device has an embedded clock, counting the seconds and reporting elapsed seconds on every check in to the server. So, sample data set looks like

CheckinTime               Runtime
2010-01-01 02:15:00.000   101500
2010-01-01 02:25:00.000   102100
2010-01-01 02:35:00.000   102700

etc.

If the device reboots, when it checks back into the server, it reports a runtime of 0.

What I'm trying to determine is some sort of quantifiable metric for the device's "health".

If a device has rebooted a lot in the past but has not rebooted in the last xx days, then it is considered healthy, compared to a device that has a big uptime except for the last xx days where it has repeatedly rebooted. Also, a device that has been up for 30 days and just rebooted, shouldn't be considered "distressed", compared to a device that has continually rebooted every 24 hrs or so for the last xx days.

I've tried multiple ways of calculating the health, using a variety of metrics: 1. average # of reboots 2. max(uptime) 3. avg(uptime) 4. # of reboots in last 24 hrs 5. # of reboots in last 3 days 6. # of reboots in last 7 days 7. # of reboots in last 30 days

Each individual metric only accounts for one aspect of the device health, but doesn't take into account the overall health compared to other devices or to its current state of health.

Any ideas would be GREATLY appreciated.

+5  A: 

You could do something like Windows' 7 reliability metric - start out at full health (say 10). Every hour / day / checkin cycle, increment the health by (10 - currenthealth)*incrementfactor). Every time the server goes down, subtract a certain percentage.

So, given a crashfactor of 20%/crash and an incrementfactor of 10%/day:

  • If a device has rebooted a lot in the past but has not rebooted in the last 20 days will have a health of 8.6

  • Big uptime except for the last 2 days where it has repeatedly rebooted 5 times will have a health of 4.1

  • a device that has been up for 30 days and just rebooted will have a health of 8

  • a device that has continually rebooted every 24 hrs or so for the last 10 days will have a health of 3.9

To run through an example:

Starting at 10
Day 1: no crash, new health = CurrentHealth + (10 - CurrentHealth)*.1 = 10
Day 2: One crash, new health = currenthealth - currentHealth*.2 = 8 But still increment every day so new health = 8 + (10 - 8)*.1 = 8.2
Day 3: No crash, new health = 8.4
Day 4: Two crashes, new health = 5.8

Eclipse
This is an interesting angle I hadn't thought of. I had forgotten that Win7 has a reliability metric. How would your example function ever get past zero, though? Start at 10, second checkin would be (10 - 10 (current health) * incrementfactor (which could be anything). That still leaves me at zero. Am I missing something?
Todd Brooks
ok, I like your edits. Any way to get that into a formula? How would the crash factor be determined, or is that an arbitrarily created constant? How does that factor into your original formula?
Todd Brooks
It means if it's health is 10, it won't get any bigger. And the lower the health, the more you'll get in a uptime
Samuel Carrijo
I think with some slight modifications to the constant values this will work out perfectly. Many thanks!!
Todd Brooks
You could come up with a formula, (look up compound interest for examples), but it'd be easier just to iterate over the last xx number of days. The crashfactor and increment factor as well as your maximum health values would be arbitrarily chosen values that you can tune to get the output you want.
Eclipse
A: 

You might take the reboot count / t of a particular machine and compare that to the standard deviation of the entire population. Those that fall say three standard deviations from the mean, where it's rebooting more often, could be flagged.

fatcat1111
A: 

You could use weighted average uptime and include the current uptime only when it would make the average higher.

The weight would be how recent the uptime is, so that most recent uptimes have the biggest weight.

svick
A: 

Does it always report it a runtime of 0, on reboot? Or something close to zero (less then former time anyway)?

You could calculate this two ways. 1. The lower the number, the less troubles it had. 2. The higher the number, it scored the largest periods.

I guess you need to account, that the health can vary. So it can worsen over time. So the latest values, should have a higher weight then the older ones. This could indicate a exponential growth.

The more reboots it had in the last period, the more broken the system could be. But also looking at shorter intervals of the reboots. Let's say, 5 reboots a day vs. 10 reboots in 2 weeks. That does mean a lot different. So I guess time should be a metric as well as the amount of reboots in this formula.

I guess you need to calculate the density of the amount of reboots in the last period.

You can use the weight of the density, by simply dividing. Because how larger the number is, on which you divide, how lower the result will be, so how lower the weight of the number can become.

Pseudo code:

function calcHealth(machine)
float value = 0;
float threshold = 800;

for each (reboot in machine.reboots) {
    reboot.daysPast = time() - reboot.time;

    // the more days past, the lower the value, so the lower the weight
    value += (100 / reboot.daysPast);
}

return (value == 0) ? 0 : (threshold / value);
}

You could advance this function by for example, filtering for maxDaysPast and playing with the threshold and stuff like that.

This formula is based on this plot: f(x) = 100/x. As you see, on low numbers (low x value), the value is higher, then on large x value. So that's on how this formula calculates the weight of the daysPast. Because lower daysPast == lower x == heigher weight.

With the value += this formula counts the reboots and with the 100/x part it gives weight to the reboot, on where the weight is the time.

At the return, the threshold is divided through the value. This is because, the higher the score of the reboots, the lower the result must be.

You can use a plotting program or calculator, to see the bending of the plot, which is also the bending of the weight of the daysPast.

Michiel
density of reboots would be useful if it were able to be weighted according to time period.
Todd Brooks
Any way this formula could be bounded? For example, limit it between a MIN and MAX value.
Todd Brooks
A: 

Are you able to break the devices out into groups of similar devices? Then you could compare an individual device to its peers.

Another suggestions is to look in to various Moving Average algorithms. These are supposed to smooth out time-series data as well as highlight trends.

Bryan Batchelder