I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my data, i.e. arithmetic mean and variance.
I've been googling for a while, but maybe I'm not using the right keywords and I haven't really found anything which is a good primer for doing this sort of calculation, so I thought I would ask here.
Can anyone point me to some good samples of how to calculate mean and variance using hadoop, and/or provide some sample code.
Thanks