ansaurus

Question

How to efficiently calculate a running standard deviation?

Answer 1

+2 A:

You could look at the Wikipedia article on Standard Deviation, in particular the section about Rapid calculation methods.

There's also an article I found that uses Python, you should be able to use the code in it without much change: Subliminal Messages - Running Standard Deviations.

Lasse V. Karlsen 2009-07-23 23:14:18

Answer 2

+4 A:

Statistics::Descriptive is a very decent Perl module for these types of calculations:

#!/usr/bin/perl

use strict; use warnings;

use Statistics::Descriptive qw( :all );

my $data = [
    [ 0.01, 0.01, 0.02, 0.04, 0.03 ],
    [ 0.00, 0.02, 0.02, 0.03, 0.02 ],
    [ 0.01, 0.02, 0.02, 0.03, 0.02 ],
    [ 0.01, 0.00, 0.01, 0.05, 0.03 ],
];

my $stat = Statistics::Descriptive::Full->new;
# You also have the option of using sparse data structures

for my $ref ( @$data ) {
    $stat->add_data( @$ref );
    printf "Running mean: %f\n", $stat->mean;
    printf "Running stdev: %f\n", $stat->standard_deviation;
}
__END__

Output:

C:\Temp> g
Running mean: 0.022000
Running stdev: 0.013038
Running mean: 0.020000
Running stdev: 0.011547
Running mean: 0.020000
Running stdev: 0.010000
Running mean: 0.020000
Running stdev: 0.012566

Sinan Ünür 2009-07-23 23:21:19

Answer 3

+1 A:

I think this issue will help you. Standard deviation

2009-07-23 23:31:14

+1 @Lasse V. Karlsen's link to Wikipedia's good, but this is the right algorithm I've used...

kenny 2009-07-24 17:44:35

Answer 4

+1 A:

How big is your array? Unless it is zillions of elements long, don't worry about looping through it twice. The code is simple and easily tested.

My preference would be to use the numpy array maths extension to convert your array of arrays into a numpy 2D array and get the standard deviation directly:

>>> x = [ [ 1, 2, 4, 3, 4, 5 ], [ 3, 4, 5, 6, 7, 8 ] ] * 10
>>> import numpy
>>> a = numpy.array(x)
>>> a.std(axis=0) 
array([ 1. ,  1. ,  0.5,  1.5,  1.5,  1.5])
>>> a.mean(axis=0)
array([ 2. ,  3. ,  4.5,  4.5,  5.5,  6.5])

If that's not an option and you need a pure Python solution, keep reading...

If your array is

x = [ 
      [ 1, 2, 4, 3, 4, 5 ],
      [ 3, 4, 5, 6, 7, 8 ],
      ....
]

Then the standard deviation is:

d = len(x[0])
n = len(x)
sum_x = [ sum(v[i] for v in x) for i in range(d) ]
sum_x2 = [ sum(v[i]**2 for v in x) for i in range(d) ]
std_dev = [ sqrt((sx2 - sx**2)/N)  for sx, sx2 in zip(sum_x, sum_x2) ]

If you are determined to loop through your array only once, the running sums can be combined.

sum_x  = [ 0 ] * d
sum_x2 = [ 0 ] * d
for v in x:
   for i, t in enumerate(v):
   sum_x[i] += t
   sum_x2[i] += t**2

This isn't nearly as elegant as the list comprehension solution above.

2009-07-23 23:35:54

I do actually have to deal with zillions of numbers, which is what motivates my need for an efficient solution. Thanks!

Alex Reynolds 2009-07-24 04:33:09

Answer 5

+1 A:

Here's one way to do it in Java:

package statistics;

public class Statistics
{
    private double sum;
    private double sumOfSquares;
    private int numPoints;

    public static void main(String[] args)
    {
        double [] values = new double[args.length];

        for (int i = 0; i < args.length; ++i)
        {
            values[i] = Double.parseDouble(args[i]);
        }

        Statistics statistics = new Statistics(values);
        System.out.println(statistics);
    }

    public Statistics(double [] values)
    {
        for (int i = 0; i < values.length; ++i)
        {
            sum += values[i];
            sumOfSquares += values[i]*values[i];
            ++numPoints;
        }
    }

    public synchronized double getAverage()
    {
        double average = 0.0;

        if (numPoints > 0)
        {
            average = sum/numPoints;            
        }

        return average;
    }

    public synchronized double getStandardDeviation()
    {
        double standardDeviation = 0.0;

        if (numPoints > 1)
        {
            double average = getAverage();
            standardDeviation = Math.sqrt(sumOfSquares/numPoints - average*average);
        }

        return standardDeviation;
    }

    public  synchronized void addValue(double newValue)
    {
        sum += newValue;
        sumOfSquares += newValue*newValue;
        ++numPoints;
    }

    public String toString()
    {
        return new StringBuilder().append("Statistics{").append("sum=").append(sum).append(", sumOfSquares=").append(sumOfSquares).append(", numPoints=").append(numPoints).append(", average=").append(getAverage()).append(", std dev=").append(getStandardDeviation()).append('}').toString();
    }

I'm not as happy with this version, because it doesn't check for overflow in the event of a value that's equal to sqrt(Double.MAX_VALUE), but it demonstrates how to calculate mean and standard deviation using running totals instead of arrays.

duffymo 2009-07-23 23:39:13

Answer 6

+18 A:

The basic answer is to accumulate the sum of both x (call it 'sum_x1') and x² (call it 'sum_x2') as you go. The value of the standard deviation is then:

stdev = sqrt((sum_x2 / n) - (mean * mean))

where

mean = sum_x / n

This is the sample standard deviation; you get the population standard deviation using 'n' instead of 'n - 1' as the divisor.

You may need to worry about the numerical stability of taking the difference between two large numbers if you are dealing with large samples. Go to the external references in other answers (Wikipedia, etc) for more information.

Jonathan Leffler 2009-07-23 23:39:33

This is what I was going to suggest. It's the best and fastest way, assuming precision errors are not a problem.

Ray Hidayat 2009-07-24 00:08:29

I decided to go with Welford's Algorithm as it performs more reliably with the same computational overhead.

Alex Reynolds 2009-07-29 23:34:28

This is a simplified version of the answer and may give non-real results depending on the input (i.e., when sum_x2 < sum_x1 * sum_x1). To ensure a valid real result, go with `sd = sqrt(((n * sum_x2) - (sum_x1 * sum_x1)) / (n * (n - 1)))

Dan Tao 2009-10-08 15:17:42

@Dan: am I missing something? Your expression appears to be different from mine - as in, guaranteed to produce a different result - because you've multiplied sum_x2 by n but not made a compensating multiplication of sum_x1 * sum_x1?

Jonathan Leffler 2009-10-08 16:42:49

@Dan points out a valid issue - the formula above breaks down for x>1 because you end up taking the sqrt of a negative number. The Knuth approach is: sqrt((sum_x2 / n) - (mean * mean)) where mean = (sum_x / n).

Greg Harman 2010-07-27 04:12:30

@flies: The answer has changed since I left that comment 1 year ago and Greg left his over two months ago. The formula used to be sqrt((sum_x2 - sum_x1 * sum_x1) / (n - 1)), which, unless I'm mistaken, was actually incorrect.

Dan Tao 2010-10-08 13:16:04

@Dan thanks for the response. deleted my previous comment.

flies 2010-10-08 13:20:47

Answer 7

+4 A:

Perhaps not what you were asking, but ... If you use a numpy array, it will do the work for you, efficiently:

from numpy import array

nums = array(((0.01, 0.01, 0.02, 0.04, 0.03),
              (0.00, 0.02, 0.02, 0.03, 0.02),
              (0.01, 0.02, 0.02, 0.03, 0.02),
              (0.01, 0.00, 0.01, 0.05, 0.03)))

print nums.std(axis=1)
# [ 0.0116619   0.00979796  0.00632456  0.01788854]

print nums.mean(axis=1)
# [ 0.022  0.018  0.02   0.02 ]

By the way, there's some interesting discussion in this blog post and comments on one-pass methods for computing means and variances:

http://lingpipe-blog.com/2009/03/19/computing-sample-mean-variance-online-one-pass/

ars 2009-07-24 02:32:58

Answer 8

+4 A:

Have a look at PDL (pronounced "piddle!").

This is the Perl Data Language which is designed for high precision mathematics and scientific computing.

Here is an example using your figures....

use strict;
use warnings;
use PDL;

my $figs = pdl [
    [0.01, 0.01, 0.02, 0.04, 0.03],
    [0.00, 0.02, 0.02, 0.03, 0.02],
    [0.01, 0.02, 0.02, 0.03, 0.02],
    [0.01, 0.00, 0.01, 0.05, 0.03],
];

my ( $mean, $prms, $median, $min, $max, $adev, $rms ) = statsover( $figs );

say "Mean scores:     ", $mean;
say "Std dev? (adev): ", $adev;
say "Std dev? (prms): ", $prms;
say "Std dev? (rms):  ", $rms;

Which produces:

Mean scores:     [0.022 0.018 0.02 0.02]
Std dev? (adev): [0.0104 0.0072 0.004 0.016]
Std dev? (prms): [0.013038405 0.010954451 0.0070710678 0.02]
Std dev? (rms):  [0.011661904 0.009797959 0.0063245553 0.017888544]

Have a look at PDL::Primitive for more information on the statsover function. This seems to suggest that ADEV is the "standard deviation".

However it maybe PRMS (which Sinan's Statistics::Descriptive example show) or RMS (which ars's NumPy example shows). I guess one of these three must be right ;-)

For more PDL information have a look at:

/I3az/

draegtun 2009-07-24 17:34:54

Answer 9

+6 A:

Bob Carpenter 2009-08-28 18:24:27

ansaurus

tags:

views:

answers:

How to efficiently calculate a running standard deviation?

related questions