I'm doing some work to analyse the access logs from a Catalyst web application. The data is from the load balancers in front of the web farm and totals about 35Gb per day. It's stored in a Hadoop HDFS filesystem and I use MapReduce (via Dumbo, which is great) to crunch the numbers.
The purpose of the analysis is try to establish a usage profile -- which actions are used most, what the average response time for each action is, whether the response was served from a backend or cache -- for capacity planning, optimisation and to set thresholds for monitoring systems. Traditional tools like Analog will give me the most-requested URL or most-used browser but none of that's useful for me. I don't need to know that /controller/foo?id=1984
is the most popular URL; I need to know what hit rate and response time for all hits to /controller/foo
is so I can see if there's room for optimisation or caching and try to estimate what might happen if hits for this action suddenly double.
I can easily break the data down into requests per action per period via MapReduce. The problem is displaying it in a digestable form and picking out important trends or anomalies. My output is of the form:
('2009-12-08T08:30', '/ctrl_a/action_a') (2440, 895)
('2009-12-08T08:30', '/ctrl_a/action_b') (2369, 1549)
('2009-12-08T08:30', '/ctrl_b/action_a') (2167, 0)
('2009-12-08T08:30', '/ctrl_b/action_b') (1713, 1184)
('2009-12-08T08:31', '/ctrl_a/action_a') (2317, 790)
('2009-12-08T08:31', '/ctrl_a/action_b') (2254, 1497)
('2009-12-08T08:31', '/ctrl_b/action_a') (2112, 0)
('2009-12-08T08:31', '/ctrl_b/action_b') (1644, 1089)
i.e., the keys are time periods and the values are tuples of (action, hits, cache hits)
per time period. (I don't have to stick with this; it's just what I have so far.)
There are about 250 actions. They could be combined into a smaller number of groups but plotting the number of requests (or response time, etc) for each action over time on the same graph probably won't work. Firstly it'll be way too noisy and secondly the absolute numbers don't matter too much -- a 100 req/min rise in requests for a often-used, lightweight, cacheable response is much less important than a 100 req/min rise in a seldom-used but expensive (maybe hits the DB) uncacheable response. One the same graph we wouldn't see the changes in requests for the little-used action.
A static report isn't much good -- a huge table of numbers is hard to digest. If I aggregate by the hour we might miss important minute-by-minute changes.
Any suggestions? How're you handling this problem? I guess one way would be to somehow highlight significant changes in the rate of requests or response time per action. A rolling average and standard deviation might show this, but could I do something better?
What other metrics or figures could I generate?