tags:

views:

240

answers:

3

hi,

I want to learn hadoop. However, I don't have access to a cluster now. Is it possible for me to learn it and use it for writing programs and learn it properly.

Would it be helpful to run multiple linux VMs and then use them as boxes to run hadoop? Or you think that is more of a stretch and running it on a multiple hosts is the same as running in on single host (in terms of setup, Hadoop API used, the architecture of the map-reduce programs etc)

Thanks,

A: 

HBase is an open source implementation of hadoop, there is a pretty decent getting started guide at the link I supplied. At some point you will want to load balance and fail over so multiple servers will be required, however for just getting started you don't need to do that.

(We use an Amazon EC2 machine instance for our testing, firing up new machine instances is trivial)

HBase has a pretty easy REST based API. Have fun.

Tim Jarvis
Isn't HBase a distributed database that runs on top of the HDFS? Its a subproject of Hadoop.
Binary Nerd
A: 

I would suggest a VM for learning purpose. Good ones are Cloudera's VM and Opensolaris live hadoop. Links are below: http://www.cloudera.com/developers/downloads/virtual-machine/ http://hub.opensolaris.org/bin/view/Project+livehadoop/

Hadoop wiki: http://wiki.apache.org/hadoop/

Getting started: http://wiki.apache.org/hadoop/QuickStart

Hadoop Turorial Videos: http://www.cloudera.com/resources/?type=Training

Turorial series: http://philippeadjiman.com/blog/2009/12/07/hadoop-tutorial-part-1-setting-up-your-mapreduce-learning-playground/

Hope it helps!

Sundar
Additional link: http://code.google.com/edu/parallel/index.html#hadoop
Sundar
+2  A: 

If you are just interested in getting to grips with the basics of Hadoop, ie how to access the HDFS, running basic MapReduce tasks etc then you can do without a cluster or even multiple VMs really.

Hadoop is able to run in three modes:

  1. Fully-distributed
  2. Pseudo-distributed
  3. Non-distributed (Local)

For the purposes of learning you can start with non-distributed mode which runs on a single machine. Everything runs inside a single JVM and none of the hadoop demons run. This is the simplest mode to get running, but still allows you to use MapReduce etc. You can get this up and running in a few minutes really, once you have the latest package downloaded.

Pseudo-dist is the next level up from non-dist. It still runs on a single machine, but simulates the operations of a cluster more accurately. The hadoop demons run in this mode and multiple JVMs are created simulating nodes in a cluster.

Fully-distributed is the mode a full-blown cluster uses.

Binary Nerd