views:

21

answers:

2

A situation has come up a few times in the last few weeks when I'd like to measure some event which might happen regulary (like the time taken to redraw a frame in a 2D smooth-scrolling UI) or variable freqency (like a message arriving to a webservice endpoint). I've had and idea of measuring 1) 'normal' frequency, 2) current frequency, 3) min, 4) max. And I'd like to measure these over multiple buckets of time.

For example, a webservice could get 10 messages in 100 ms, then not get any messages for 5 minutes. In the UI example it could be running at 60 FPS for 10 seconds straight, then a GC hits and a single frame could be 'frozen' for 1 second, which completely ruins the UI effect.

I think these kinds of measurements could be done using a set of 'buckets' for collecting measurements. But unlike time series the FPS measurement I care about most is the one that DOESN'T arrive at the normal interval (normal in the UI example is a single frame drawn every 1/60th of a sec, but the one I care about is 60x longer). So to be useful both in the normal case and the exceptional case, one could use a hierarchy of 'sample buckets'.

1..10 'micro' buckets, each measures 1/10th of a second
many 'micro' buckets are needed to keep an accurate sliding window at the 'normal' level

1..60 'normal' buckets, each 1 sec
1..60 'macro' buckets, each 1 min
... levels could continue: hours, days, months, years

A set of metrics (avg, min, max, count) could be kept per bucket at each level. When the time period for a bucket expires the bucket could be 'promoted' to the next level and combined into a 'sample queue' at that level. This would give an accurate sliding-window measurement of each aggregate per bucket at each 'level' in the hierarchy, while using relatively little CPU or memory.

In a development environment I think samples at the 'micro' level could be used to identify real-time problems while debugging. In production the 'normal' level could be displayed to the end user while the 'macro' level could be stored for long-term trending and analysis (to establish a long-term baseline). Once patterns are identified it seems like it would be easy to programatically log or react to significant changes in a metric (like the message rate drop-off to flush memory caches) without overreacting to acceptable anamolies (like the GC pause in the UI).

I know this is a bit long, but it seems like a simple idea and I couldn't find any classes or frameworks that do this on the web (at least not in my framework of choice, .NET). Is this a known pattern for measuring and evaluating the health of applications, systems, or measurements and I just couldn't find it? Any monitoring library or statistical recipe available via open source or over-the-counter?

P.S. Because of the possible high rate of sampling I didn't think PerformanceCounters on Windows would be a good fit at the 'micro' level (updating metrics many times per second in some cases, like real-time display of UI FPS). Also, it would be great if the solution worked on Mono and Silverlight (were PerfCounters aren't available). P.P.S. I spent a couple of hours looking for statistics libraries in .NET, and found a couple, but couldn't find a simple 'hierarchical time-bounded sampling' like I describe above. Lots of count bounded sampling, which doesn't apply here, because data streams like redraw rate and message arrival rate don't alway occur at regular intervals.

+1  A: 

It looks like this is common in financial services, using 'time comression' to speed data analysis when the original data set (or even indexes on the original data) don't fit into memory.

This link gives an example in SQL. I'd like to use the same memory/speed trade-off to track critical metrics in-process.

http://www.codeproject.com/KB/solution-center/Izenda-Speed-Dating.aspx

I'm wondering if I'm missing something simple, as it seems like this would be really useful but I don't see anyone else doing it.

crtracy
A: 

This is a good approach, and certainly one I've heard of and implemented, though I'm not familiar with any libraries which implement it in a generic way.

One alternative to doing this 'live' is to log things at a very fine-grained level in e.g. a database, and then progressively 'collapse' the data as it becomes out-of-date/irrelevant. For example, imagine a SQL table which contains {DATE, GRANULARITY, COUNT} tuples; you initially insert your counts with 'Second' granularity; periodically you come along and coalesce a set of rows like

DATE                GRANULARITY       COUNT
20100917 10:05:01   Second            4
20100917 10:05:08   Second            2
20100917 10:05:40   Second            1

into a single row:

20100917 10:05:00   Minute            7

based on their age, and then collapse the minutes into hours, etc etc.

We do something similar at my current employer; we log sampled data at a high frequency with the open-source Performance Co-Pilot tool, and then as the data becomes older and less valuable coalesce that into more compact, coarser-grained logs using the pmlogextract tool.

Cowan
Yep, that matches what I describe. If no one has offered an open library to do this then maybe I'll be the first. Thanks for at least admitting you've taken this approach in the past, worth an answer nod.
crtracy