ansaurus

Question

An accurate running statistical mean of a large array of bytes

Answer 1

+2 A:

You could use a larger-size integer type (long / bigInt), or even arbitrary precision arithmetic, to calculate the sum. In this case, you don't really need the online algorithm, although keeping it wouldn't have any impact other than making the calculation slower.

When you do divide the sum by the count to calculate the mean, you will of course be restricted by the precision of the floating-point type you are using, so keep that in mind. If you go down the APA route, this will not be an issue.

Ani 2010-09-02 17:14:22

using a larger size integer type would take more memory and, even if only held for the duration in which the means were being calculated, would result in exceeding reasonable overheads. I fixed this problem in the end by re-arranging the loop i was using to calculate the mean, thanks anyway! :-)

sinjax 2010-09-03 12:45:09

A long would take four more bytes and would be rather unlikely to overflow. It would be a perfectly practical thing to use.

Tom Anderson 2010-09-03 12:56:31

Answer 2

A:

If you are calculating 128 means could you not afford to allocate 128 doubles (dmean[] say) to hold them, use

double diff = samples[i][k] - dmean[k];

dmean[k] = dmean[k] + diff/(i+1) ;

to update the mean?

dmuir 2010-09-02 17:23:07

no, unfortunately for the whole sampleset this would become rather expensive rather quickly. I fixed this problem by rearranging the loop i was using to calculate the mean, i've explained above

sinjax 2010-09-03 12:43:59

Answer 3

A:

Right. So I decided that I would have to hold a double at least to calculate any given dimension's mean.

The issue was that i was approaching this problem by going:

for each sample, get the array it is to update
    for each dimension in that array, calculate it's running mean given the new sample

The issue with that is that a double[][] would have to be held holding the current running mean for each dimension of each element to update. Therefore, i have now rearranged my loop to look more like this:

for each array to be updated
    for each sample that will update this array
        for each dimension in the array to be updated calculate the running mean

This way round requires some pre-processing, i need to loop through all the samples to find which samples will update which arrays (a single indecies array) but my overall saving is that i can now hold a SINGLE double which is updated for each sample which updates a given array for a given dimension of that sample.

This double can then be cast to the appropriate low precision type, in my case, a byte.

The overall saving in terms of storage space i was initially going for was:

replace Integers (costing 4*128*numberOfSamples) with Bytes (costing 1*128*numberOfSamples)

that didn't work, but i have now formulated a solution that costs something like: (128*numberOfSamples + numberOfSamples). A saving of 127*numberOfSamples. Which in my worst case is something approaching 15Gb of RAM :-)

So yeah, there we go, a nights sleep and i answered my own question.

Thanks for the help fellows!

sinjax 2010-09-03 12:52:58

ansaurus

tags:

views:

answers:

An accurate running statistical mean of a large array of bytes

related questions