views:

907

answers:

4

Hi everyone,

In our logfiles we store response times for the requests. What's the most efficient way to calculate the median response time, the "75/90/95% of requests were served in less than N time" numbers etc? (I guess a variation of my question is: What's the best way to calculate the median and standard deviation of a bunch stream of numbers).

The best I came up with was just reading all the numbers, ordering them and then picking out the numbers, but that seems really goofy. Isn't there a smarter way?

We use Perl, but solutions for any language might be helpful.

+5  A: 

you can have look at quick select:

http://en.wikipedia.org/wiki/Selection%5Falgorithm

Or at the Wirth algorithm: http://www.mail-archive.com/[email protected]/msg20059.html

Benchmark for the median can be found here: http://ndevilla.free.fr/median/median/index.html

LeMiz
+8  A: 

See the article Calculating Percentiles in Memory-bound Applications. It explains how to calculate median and other percentiles efficiently.

Also, here's an article on calculating standard deviation (variance) as you go: Accurately computing running variance.

John D. Cook
+3  A: 

Have a look at PDL... the Perl Data Language.

Also see these previous SO questions about mean/std dev:

/I3az/

draegtun
Thanks draegtun!
Ask Bjørn Hansen
+1  A: 

There are code examples here: http://rosettacode.org/wiki/Standard%5FDeviation

glenn jackman