views:

339

answers:

3

Can somebody outline the various differences between the various Hadoop Distributions available:

using the Apache Hadoop distro as a baseline.

Is there a good reason to using one of these distributions over the standard Apache Hadoop distro?

+7  A: 

Disclaimer: I interned at Cloudera this summer (but some of my best friends are at Yahoo! :-))

The Yahoo distribution is a version of Hadoop 20 that they run (ran?) on some subset of their clusters. It includes a set of patches for stability, bug fixes, etc. It is a source release; it does not have admin-friendly features like rpm or debian packages, etc.

The Cloudera distribution is packages as rpms and debs (the source is also available). This means you can get updates via standard methods, etc. It also includes stability and bug fix patches. It is constantly maintained (not to say Yahoo's isn't -- I suppose one could just go on github and check when they last updated it). It also packages Pig and Hive.

Cloudera's distribution of Hadoop 20 is in beta, and 18 is considered stable (more on this on the Cloudera blog). The 18 version also includes packages for Hive and Pig; for 20, you have to build them yourself (there aren't official releases of Pig or Hive that support 20 yet, although patches exist). There may well be significant overlap between the Cloudera and Yahoo versions of 20; both provide manifests, so you can check. The latest documentation of Cloudera's distros is at http://archive.cloudera.com

Yahoo does not provide support for their distribution; they provide their patched version as a service to the community, so the folks who are interested can build what Yahoo runs internally. Given the size of Yahoo clusters, that's a significant contribution, especially if you aren't a Hadoop developer who follows the JIRAs all the time. Cloudera supports their distribution commercially, as well as providing some community support via the Hadoop mailing lists and, for distro-specific issues, on their GetSatisfaction page.

Both are pretty different from the vanilla Apache distro since they patch it in between releases (the cloudera version of 20 has 60+ patches!).

SquareCog
Awesome, thanks for the insight into both distros...
Jon
A: 

SquareCog is right on almost all points except: The Yahoo! distribution is what is run on all the production clusters at Yahoo!, not a subset of them. This is more than 25,000 machines in total. The Yahoo! distribution has had the extensive, end-to-end testing necessary to ensure reliable, consistent operation. The other distribution is more liberal about applying patches and so may have more features, but has not been tested as extensively.

Jakob Homan
A: 

Beginning with version 3, Cloudera's Distribution for Hadoop includes not just HDFS and MapReduce, but also ZooKeeper, Hive, Pig, HBase, Oozie, Flume, Sqoop, Hue, and Whirr. All of these components are tested together and deployed on single node through multi-thousand node clusters. For more details, see our blogs post for the CDH3b2 release and the CDH3b3 release.

Each component of the distribution begins with an Apache-licensed release. On top of this release we add patches that we have found useful in customer engagements. These patches may backport bug fixes, performance improvements, or in rare cases, critical new features. They are rigorously tested and made available in packages for most Linux platforms, as well as source tarballs.

It's important to note that CDH is 100% Apache-licensed source code that you can download and build for yourself, if you'd like.

You won't find a version of Hadoop that's more feature complete or stable across multiple platforms. If you end up using Hadoop in a production environment, it's worth noting that you'll be able to obtain professional support and management tools with Cloudera Enterprise.

Jeff Hammerbacher