My team built a Java application using the Hadoop libraries to transform a bunch of input files into useful output. Given the current load a single multicore server will do fine for the coming year or so. We do not (yet) have the need to go for a multiserver Hadoop cluster, yet we chose to start this project "being prepared".
When I run this app on the command-line (or in eclipse or netbeans) I have not yet been able to convince it to use more that one map and/or reduce thread at a time. Given the fact that the tool is very CPU intensive this "single threadedness" is my current bottleneck.
When running it in the netbeans profiler I do see that the app starts several threads for various purposes, but only a single map/reduce is running at the same moment.
The input data consists of several input files so Hadoop should at least be able to run 1 thread per input file at the same time for the map phase.
What do I do to at least have 2 or even 4 active threads running (which should be possible for most of the processing time of this application)?
I'm expecting this to be something very silly that I've overlooked.
I just found this: https://issues.apache.org/jira/browse/MAPREDUCE-1367 This implements the feature I was looking for in Hadoop 0.21 It introduces the flag mapreduce.local.map.tasks.maximum to control it.
For now I've also found the solution described here in this question.