views:

26

answers:

2

I'm writing a report that shows total downtime for our website. When a user visits our site and something's not working (ie the load balancer thinks our site isn't responsive), it sends visitors to a "Maintenance" page. The maintenance page logs to the database that it's been viewed and displays a friendly message to the visitor.

That means that I end up with a table of values that looks like this:

ReportedOutage
-----------------------
2010-07-30 06:23:18.093
2010-07-30 06:23:18.623
2010-07-30 06:23:18.720
2010-08-02 14:28:07.123

Ideally, I'd want to run the report and see something like this:

OutageStart              OutageEnd
-----------------------  -----------------------
2010-07-30 06:23:18.093  2010-07-30 06:23:18.720
2010-08-02 14:28:07.123  2010-08-02 14:28:07.123

Since I have only the failed records in the logs, how do I calculate the length of the various outages? I can start by getting MIN(Reported), but then I have to find the last record in the series, such that there's a time period in between it at the next record.

Any thoughts on how to do this? I realize that I could create a process to check the site every minute and record outages and successes, which would make this easier, but I'm trying to work with what I've got before I add another step.

+3  A: 

It sounds like you basically need to guess at some maximum time between visits. So if you actually only had one visit every 10 days, then everything in that table could represent a single outage... but it's rather likely that it didn't.

So guess at a reasonable value - e.g. 5 minutes (it would be unusual to not have any hits in 5 minutes, and unusual for two separate outages to occur within 5 minutes of each other). Then find any gap between two values (sorted into chronological order, of course) where the gap is greater than that time interval. Those records will indicate the end of one outage and the start of the next.

Exactly how you do that will depend on your environment - I know how I'd do it in C#, but I wouldn't presume to try it in straight SQL, for example :)

Jon Skeet
On further thought, my question is whether I'm going to have to iterate through the entire list (which could get long) and decide if each value constitutes the beginning/middle/end of an outage, and it sounds like I will. Thanks for your help.
rwmnau
+1  A: 

Unless you have some other information about how often the server gets hits, you can't answer the question you are trying to answer.

And even if you have that data, a rigorous analysis of server outages would not be easy:

If you have information about how often the site has been historically hit in a certain time interval (say 6am-7am on a Monday), you could model the probabilities of server failures using a Poisson process and fit it to your data for that interval. that will give you the likelihood of an outage in that time interval, and if you model the length of an outage correctly (or guess it well), you could get an expected duration of all outages in a given day.

For most applications, it would be much simpler and more accurate to implement the check process you mentioned in your post.

Assaf
The website is for a major ISP, so length between visits isn't a concern - most of the day, it's less than a second, though it can be minutes overnight. I'll probably end up walking the list and deciding, for each value I have, whether it's significantly tied to the start/end of an outage. Thanks!
rwmnau