views:

30

answers:

2

Hi, I am planning to use Hadoop on EC2. Since we have to pay per instance usage, it is not good to have fixed number of instances than what are actually required for the job.

In our application, many jobs are executed concurrently and we do not know the slave requirement all the time. Is it possible to start the hadoop cluster with minimum slaves and later on manage the availability based on requirement?

i.e. create/destroy slaves on demand

Sub question: Can hadoop cluster manage multiple jobs concurrently?

Thanks

+1  A: 

The default scheduler that is used in hadoop is a simple FIFO one, you can look into using FairScheduler which assigns a share of the cluster to each of the running jobs and has extensive configuration to control those shares.

As far as EC2 is concerned - you can easily start of with some number of nodes and then once you see that there are too many tasks in the queue and all the slots in the cluster are occupied - add more of them. You will simply have to start up an instance and launch a task tracker on it that will register with the jobtracker.

However you will have to have your own system that will manage startup and shutdown of these nodes.

Dmytro Molkov
A: 

This seems promising http://hadoop.apache.org/common/docs/r0.17.1/hod.html

Nayn