views:

40

answers:

3

I have a two dimensional array of bytes which looks like this:

int n = 100000;
int d = 128;
byte[][] samples = new byte[n][d]
/* proceed to fill samples with some delicious data */
byte[] mean = new byte[d];
findMean(mean,samples);

My findMean function proceeds to fill mean such that:

mean[k] = mean(samples[:][k])

Simple enough so far. The issue is, due to overflow issues this mean function cannot simply make a sum and divide. So my current attempt is to calculate a running mean, the workhorse of which looks something like this:

for(int i = 0; i < samples.length; i++){
    byte diff = samples[i][k] - mean[k]
    mean[k] = (byte)((double)mean[k] + (Math.round( (double) ( diff ) / (double) (i + 1) )))

Now this doesn't work at all, every round the precision loss results in the mean being quite far off the correct value, which i have verified on small (therefore calculable) sets of 1000 random samples.

Also, due to the very memory issues which i'm trying to avoid by using byte arrays in the first place, it is quite impossible to allocate a large proxy float array to calculate the true mean, and later cast to a byte.

Loading this data in chunks is... well it's possible but i'm considering that my final alternative, and anyway, that surly just displaces the problem to the chunk size?

Anyway, accurate calculate of the mean for an array of bytes using a running algorithm to avoid overflow issues. Is there a good solution here?

Cheers

+2  A: 

You could use a larger-size integer type (long / bigInt), or even arbitrary precision arithmetic, to calculate the sum. In this case, you don't really need the online algorithm, although keeping it wouldn't have any impact other than making the calculation slower.

When you do divide the sum by the count to calculate the mean, you will of course be restricted by the precision of the floating-point type you are using, so keep that in mind. If you go down the APA route, this will not be an issue.

Ani
using a larger size integer type would take more memory and, even if only held for the duration in which the means were being calculated, would result in exceeding reasonable overheads. I fixed this problem in the end by re-arranging the loop i was using to calculate the mean, thanks anyway! :-)
sinjax
A long would take four more bytes and would be rather unlikely to overflow. It would be a perfectly practical thing to use.
Tom Anderson
A: 

If you are calculating 128 means could you not afford to allocate 128 doubles (dmean[] say) to hold them, use

double diff = samples[i][k] - dmean[k];

dmean[k] = dmean[k] + diff/(i+1) ;

to update the mean?

dmuir
no, unfortunately for the whole sampleset this would become rather expensive rather quickly. I fixed this problem by rearranging the loop i was using to calculate the mean, i've explained above
sinjax
A: 

Right. So I decided that I would have to hold a double at least to calculate any given dimension's mean.

The issue was that i was approaching this problem by going:

for each sample, get the array it is to update
    for each dimension in that array, calculate it's running mean given the new sample

The issue with that is that a double[][] would have to be held holding the current running mean for each dimension of each element to update. Therefore, i have now rearranged my loop to look more like this:

for each array to be updated
    for each sample that will update this array
        for each dimension in the array to be updated calculate the running mean

This way round requires some pre-processing, i need to loop through all the samples to find which samples will update which arrays (a single indecies array) but my overall saving is that i can now hold a SINGLE double which is updated for each sample which updates a given array for a given dimension of that sample.

This double can then be cast to the appropriate low precision type, in my case, a byte.

The overall saving in terms of storage space i was initially going for was:

replace Integers (costing 4*128*numberOfSamples) with Bytes (costing 1*128*numberOfSamples)

that didn't work, but i have now formulated a solution that costs something like: (128*numberOfSamples + numberOfSamples). A saving of 127*numberOfSamples. Which in my worst case is something approaching 15Gb of RAM :-)

So yeah, there we go, a nights sleep and i answered my own question.

Thanks for the help fellows!

sinjax