views:

128

answers:

2

In the latest Hadoop Studio the 0.18 API of Hadoop is called "Stable" and the 0.20 API of Hadoop is called "Unstable".

The distribution that comes from Yahoo is a 0.20 (with yahoo patches), which is apparently "the way to go". From cloudera they state the 0.20 (with cloudera patches) is also stable.

Now given the fact that we'll start coding a new Hadoop project in the next few weeks; which API should we use and which Hadoop distribution (Apache, Cloudera, Yahoo, ...) should we use?

Thanks for your insights.

+3  A: 

Version 20 is currently the best balance of stability and features. 18 is getting pretty long in the tooth. As far as the Y!/Cloudera/Apache distro, these choices come into play when deploying a cluster, not necessarily when writing the application. I'd recommend the Y! distribution - it's what Yahoo!, which runs more than 25,000 Hadoop nodes, uses internally and is therefore very well tested and reliable.

Jakob Homan
+1  A: 

I personally would argue for staying with the 0.18 apis. The .20 apis are really considered 'evolving' and in .20 are incomplete, and will change again in .21.

That said, you should strongly consider not coding to the Hadoop apis at all, but use Pig/Hive/Cascading as your primary interfaces. Though only Cascading is actually an alternative api, unlike Pig/Hive which are syntaxes. The point being is that these projects will shield you from many of the changes as they evolve. And all already encapsulate best practices re tuning and configuration.

Also keep in mind that the apis are all or nothing, you can't mix api calls in the same app.

cwensel