A situation has come up a few times in the last few weeks when I'd like to measure some event which might happen regulary (like the time taken to redraw a frame in a 2D smooth-scrolling UI) or variable freqency (like a message arriving to a webservice endpoint). I've had and idea of measuring 1) 'normal' frequency, 2) current frequency, 3) min, 4) max. And I'd like to measure these over multiple buckets of time.
For example, a webservice could get 10 messages in 100 ms, then not get any messages for 5 minutes. In the UI example it could be running at 60 FPS for 10 seconds straight, then a GC hits and a single frame could be 'frozen' for 1 second, which completely ruins the UI effect.
I think these kinds of measurements could be done using a set of 'buckets' for collecting measurements. But unlike time series the FPS measurement I care about most is the one that DOESN'T arrive at the normal interval (normal in the UI example is a single frame drawn every 1/60th of a sec, but the one I care about is 60x longer). So to be useful both in the normal case and the exceptional case, one could use a hierarchy of 'sample buckets'.
1..10 'micro' buckets, each measures 1/10th of a second
many 'micro' buckets are needed to keep an accurate sliding window at the 'normal' level
1..60 'normal' buckets, each 1 sec
1..60 'macro' buckets, each 1 min
... levels could continue: hours, days, months, years
A set of metrics (avg, min, max, count) could be kept per bucket at each level. When the time period for a bucket expires the bucket could be 'promoted' to the next level and combined into a 'sample queue' at that level. This would give an accurate sliding-window measurement of each aggregate per bucket at each 'level' in the hierarchy, while using relatively little CPU or memory.
In a development environment I think samples at the 'micro' level could be used to identify real-time problems while debugging. In production the 'normal' level could be displayed to the end user while the 'macro' level could be stored for long-term trending and analysis (to establish a long-term baseline). Once patterns are identified it seems like it would be easy to programatically log or react to significant changes in a metric (like the message rate drop-off to flush memory caches) without overreacting to acceptable anamolies (like the GC pause in the UI).
I know this is a bit long, but it seems like a simple idea and I couldn't find any classes or frameworks that do this on the web (at least not in my framework of choice, .NET). Is this a known pattern for measuring and evaluating the health of applications, systems, or measurements and I just couldn't find it? Any monitoring library or statistical recipe available via open source or over-the-counter?
P.S. Because of the possible high rate of sampling I didn't think PerformanceCounters on Windows would be a good fit at the 'micro' level (updating metrics many times per second in some cases, like real-time display of UI FPS). Also, it would be great if the solution worked on Mono and Silverlight (were PerfCounters aren't available). P.P.S. I spent a couple of hours looking for statistics libraries in .NET, and found a couple, but couldn't find a simple 'hierarchical time-bounded sampling' like I describe above. Lots of count bounded sampling, which doesn't apply here, because data streams like redraw rate and message arrival rate don't alway occur at regular intervals.