views:

83

answers:

2

My team built a Java application using the Hadoop libraries to transform a bunch of input files into useful output. Given the current load a single multicore server will do fine for the coming year or so. We do not (yet) have the need to go for a multiserver Hadoop cluster, yet we chose to start this project "being prepared".

When I run this app on the command-line (or in eclipse or netbeans) I have not yet been able to convince it to use more that one map and/or reduce thread at a time. Given the fact that the tool is very CPU intensive this "single threadedness" is my current bottleneck.

When running it in the netbeans profiler I do see that the app starts several threads for various purposes, but only a single map/reduce is running at the same moment.

The input data consists of several input files so Hadoop should at least be able to run 1 thread per input file at the same time for the map phase.

What do I do to at least have 2 or even 4 active threads running (which should be possible for most of the processing time of this application)?

I'm expecting this to be something very silly that I've overlooked.


I just found this: https://issues.apache.org/jira/browse/MAPREDUCE-1367 This implements the feature I was looking for in Hadoop 0.21 It introduces the flag mapreduce.local.map.tasks.maximum to control it.

For now I've also found the solution described here in this question.

A: 

According to this thread on the hadoop.core-user email list, you'll want to change the mapred.tasktracker.tasks.maximum setting to the max number of tasks you would like your machine to handle (which would be the number of cores).

This (and other properties you may want to configure) is also documented in the main documentation on how to setup your cluster/daemons.

matt b
There's no option like:`mapred.tasktracker.tasks.maximum`, there are separate options for map and reduce: `mapred.tasktracker.{map|reduce}.tasks.maximum`, it's under the second link you have posted.
Wojtek
my interpretation of that was that you could have `map` or `reduce` or none. The email thread is from 2007 but the author of Hadoop mentioned using `mapred.tasktracker.tasks.maximum`
matt b
Well, this email is from 2007, it most likely concerns version before 0.16 of hadoop, since separate options for mappers and reducers were introduced in 0.16 (and 0.16 was introduced somewhere around 2008) take a look at: http://hadoop.apache.org/common/docs/r0.15.2/cluster_setup.html#Configuring+the+Hadoop+Daemons and http://hadoop.apache.org/common/docs/r0.16.0/cluster_setup.html#Configuring+the+Hadoop+Daemons
Wojtek
I just found that mapred.tasktracker.tasks.maximum was deprecated in Hadoop 0.16 ( https://issues.apache.org/jira/browse/HADOOP-1274 ) and is now mapred.tasktracker.{map|reduce}.tasks.maximum.
Niels Basjes
+2  A: 

I'm not sure if I'm correct, but when you are running tasks in local mode, you can't have multiple mappers/reducers.

Anyway, to set maximum number of running mappers and reducers use configuration options mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum by default those options are set to 2, so I might be right.

Finally, if you want to be prepared for multinode cluster go straight with running this in fully-distributed way, but have all servers (namenode, datanode, tasktracker, jobtracker, ...) run on a single machine

Wojtek
Thanks, because of your observation I downloaded the source and dug through that. I found that when running in local mode the org.apache.hadoop.mapred.LocalJobRunner is used to actually run the job. The run() method simply does everything sequentially. No threading at all. I did find org.apache.hadoop.mapreduce.lib.map.MultithreadedMapperA very strange feature: A mapper implementation that does threading OUTSIDE of the actual Hadoop framework. According to the documentation only useful if you are not CPU bound. Our tool is CPU bound so we can't use this.
Niels Basjes