views:

31

answers:

1

Background

My employer is progressively shifting our resource intensive ETL and backend processing logic from MySQL to Hadoop ( dfs & hive ). At the moment everything is still somewhat small and manageable ( 20 TB over 10 nodes ) but we intend to progressively increase the cluster size.

Now that hadoop is being shifted into production use, its becoming a bigger issue of batch scheduling and sharing the cluster between ad-hoc user hive queries, hourly M/R processes, and I believe eventually some usage of hbase. The fear is that a naive query will be made by a user that could potentially run for an unreasonable amount of time ( say 4 hours ) clogging up the task queue and producing potential infrastructure load instabilities.

Question

Another section of my company has already been burned by Flume's immaturity, so my question is, how stable are the two known schedulers ( Capacity & Fair ) and besides usage in their sponsoring companies ( Yahoo & Facebook ) are they used elsewhere?

Edit: Background info

http://www.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/

http://hadoop.apache.org/mapreduce/docs/r0.21.0/fair_scheduler.html

http://hadoop.apache.org/mapreduce/docs/r0.21.0/capacity_scheduler.html

+1  A: 

We ship CDH with the Fair Share scheduler on by default. It's quite stable.

Jeff Hammerbacher
@Jeff took me a minute to figure out who you are :)
David