views:

326

answers:

6

Suppose I have an iterator of numbers, for example a file object:

f = open("datafile.dat")

now I want to do:

mean = get_mean(f)
sigma = get_sigma(f, mean)

What is the best implementation? Suppose that the file is big

+2  A: 

Make a list from the iterable, or use itertools.tee().

Ignacio Vazquez-Abrams
but wouldn't whole file has to be kept in memory? because get_sigma needs input from get_mean, in that case why not just load whole file in memory
Anurag Uniyal
now I finally know how to I can make codeblock with a link
Otto Allmendinger
+5  A: 

If you want to iterate once, you can write your sum function:

def mysum(l):
    s2 = 0
    s = 0
    for e in l:
        s += e
        s2 += e * e
    return (s, s2)

and use the result in your sigma function.

Edit: now you can calculate the variance like this: (s2 - (s*s) / N) / N

By taking account of @Adam Bowen's comment,
keep in mind that if we use mathematical tricks and transform the original formulas
we may degrade the results.

Nick D
With this solution the mean is `s/n` and the variance is `s2/n - mean*mean` that is to say, the mean of the squares minus the square of the mean. However, you must be aware that calculating the variance this way may be inaccurate for large n because of the difference in scale between s2 and e*e during the accumulation. Unfortunately, this means that for large n the two-pass algorithm is much more accurate (and a better choice).
Adam Bowen
@Adam Bowen, thanks. I forgot to mention that.
Nick D
+1  A: 

I am not sure there is much choice.

You will have to iterate your numbers twice in any case as the standard deviation will require the mean information on each value.

If you have enough memory, you can gain on the I/O access by loading your file in memory during the first iteration but that is about it IMO.

Benoit Vidis
This is false, as per Wikipedia articles cited below...
Mapio
A: 

You have two solutions

  1. Make a list out of your iterator and loop it as many time as you wish. Drawback is everything will be in memory, so not suitable if your file is big. Simple use of itertools.tee also will not save you

  2. There is no other solution , unless , you do not need to pass output of get_mean to get_sigma, because in that case they can only be in series, but if you remove this restriction then you can run both functions in parallel using threads, and use itertools.tee to have two iterators from one

Anurag Uniyal
+3  A: 

I think Nick D has the correct answer.

Assming you want to compute both mean and variance in one sweep of the file (and you don't really want two functions that have to be called one after the other), you can collect the sum of the values and of their squares and them use such sums (toghether with the number of read elements) to compute at the same time mean and variance.

There are some numerical stability issues, but the idea in

http://en.wikipedia.org/wiki/Computational_formula_for_the_variance

is the basic ingredient you need. Some more details are at

http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

where I suggest you to read the "Naïve algorithm".

Hope this helps,

Massimo

Mapio
A: 

As I feel that there are good elements scattered in multiple answers, I would like to summarize:

  • If your file is too big to conveniently fit in memory, and if you want a good precision in the variance, you do need to read the file twice (with one pass, the variance is the difference between two large numbers, which is not precise because of floating point limitations). Note that your operating system is likely to provide some automatic speed-up for the second file reading, as it may still be in RAM during the second pass.

  • If you do not care for the precision of the variance, you can simply iterate once over the file and calculate the quantities suggested by Nick D, with the details provided in the comment by Adam Bowen.

EOL