views:

99

answers:

1

Hi,

I have an apparently "simple" problem but I can't find the solution for some reason...
I have n millions files of different sizes and I want to find the average filesize.
To simplify it, I grouped them in multiples of 16KB.

< 16 KB = 18689546 files
< 32 KB = 1365713 files
< 48 KB = 1168186 files
...

Of course, the simple (total_size / number of files) does not work. It gives an average of 291KB...
What would be the algorithm to calculate the real average...?

Thx, JD

+1  A: 

You might be running into a problem with overruns when summing the file sizes (the total size probably doesn't fit into a 32-bit value). The easiest fix might be to try using a 64-bit int for the variable that's holding the sum.

Michael Burr
I do use 64bits.The problem is that (total_size / number of files) cannot work.By example, with this formula, 10 files of 1KB and 1 file of 1MB would give an average of 94KB... which is of course wrong.
JD
Are you maybe looking for something different than average? 94 KB is the correct average of 11 files that are 1.01 MB. What number would you expect to get for these files?
Adam Ruth
"which is of course wrong" - I guess you need to specify a little more clearly what average you're looking for. The mean (which is commonly referred to as 'the average') size of 10 1KB files and 1 1MB file is 94KB, so if you're not looking for the mean you should make it clear what you *are* looking for.
Michael Burr
Hum... I would have expected something closer to 1KB since 10 files out of 11 are 1KB... Would it be called something like weighted average?
JD
Well, 94 KB is a lot closer to 1 KB than it is to 1024 KB. 90% closer, actually. Maybe you want the median, which in this case would be 1 KB. http://en.wikipedia.org/wiki/Median
Adam Ruth
You might be looking for the median (http://en.wikipedia.org/wiki/Median), which is often used instead of the mean for data sets that might not have a normal distribution. But you probably need someone who knows more about statistics than I do to get a really good answer.
Michael Burr