views:

54

answers:

0

I work for a Fortune 500 company that struggles with accurately measuring performance and availability for high availability applications (i.e., apps that are up 99.5% with 5 seconds page to page navigation). We factor in both scheduled and unscheduled downtime to determine this availability number. However, we recently added a CDN into the mix, which kind of complicates our metrics a bit. The CDN now handles about 75% of our traffic, while sending the remainder to our own servers.

We attempt to measure what we call a "true user experience" (i.e., our testing scripts emulate a typical user clicking through the application.) These monitoring scripts sit outside of our network, which means we're hitting the CDN about 75% of the time.

Management has decided that we take the worst case scenario to measure availability. So if our origin servers are having problems, but yet the CDN is serving content just fine, we still take a hit on availability. The same is true the other way around. My thought is that as long as the "user experience" is successful, we should not unnecessarily punish ourselves. After all, a CDN is there to improve performance and availability!

I'm just wondering if anyone has any knowledge of how other Fortune 500 companies calculate their availability numbers? I look at apple.com, for instance, of a storefront that uses a CDN that never seems to be down (unless there is about to be a major product announcement.) It would be great to have some hard, factual data because I don't believe that we need to unnecessarily hurt ourselves on these metrics. We are making business decisions based on these numbers.

I can say, however, given that these metrics are visible to management, issues get addressed and resolved pretty fast (read: we cut through the red-tape pretty quick.) Unfortunately, as a developer, I don't want management to think that the application is up or down because some external factor (i.e., CDN) is influencing the numbers.

Thoughts?

(at the request of Sanoj, I posted this question over to ServerFault...can somebody close this question? http://serverfault.com/questions/119186/looking-for-a-recommendation-on-measuring-a-high-availability-app-that-is-using-a)