views:

219

answers:

2

I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my data, i.e. arithmetic mean and variance.

I've been googling for a while, but maybe I'm not using the right keywords and I haven't really found anything which is a good primer for doing this sort of calculation, so I thought I would ask here.

Can anyone point me to some good samples of how to calculate mean and variance using hadoop, and/or provide some sample code.

Thanks

A: 

Pig latin has an associated library of reusable code called PiggyBank that has numerous handy functions. Unfortunately it didn't have variance last time I checked, but maybe that has changed. If nothing else, it might provide examples to get you started on your own implementation.

I should note that variance is difficult to implement in a stable way over huge data sets, so take care!

Marcelo Cantos
Here is a link to the PiggyBank UDF about statistics (Correlation, Covariance):http://svn.apache.org/viewvc/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/stats/
Ro
As a stopgap, the identity cov(x,x)=var(x) provides a way of generating variance quickly in Pig. There are two JIRAs open working on generating a stable var function. Hopefully in Pig 0.8.
Jakob Homan
A: 

You might double check and see if your clustering code can drop into Cascading. Its quite trivial to add new functions, do joins, etc with your existing java libraries.

http://www.cascading.org/

And if you are into Clojure, you might watch these github projects: http://github.com/clj-sys

They are layering new algorithms implemented in Clojure over Cascading (which in turn is layered over Hadoop MapReduce).

cwensel