views:

42

answers:

1

Hi,

I am doing some text processing using hadoop map-reduce jobs. My job is 99.2% complete and stuck on last map job.

The last few lines of the map output show as below. Last time, when this problem occured, I tried printing out the key values emmited from map and noticed that one of the key is having large number of values associated with it and I think, it appeared stuck as it was sorting those values. Then, I stopped emmiting that key from map job and it worked fine.

I think, same problem has occurred again and printing out the key value pairs is a tedious job as the job is time taking. Is there a better alternative? Like configure hadoop to forget few keys if they are taking too much time on sorting. Is there something like this.

2010-10-20 14:43:32,274 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2010-10-20 14:43:32,274 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 79698262; bufvoid = 99614720
2010-10-20 14:43:32,274 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0; kvend = 6601; length = 327680
2010-10-20 14:43:33,272 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0
2010-10-20 14:50:44,113 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2010-10-20 14:50:44,113 INFO org.apache.hadoop.mapred.MapTask: bufstart = 79698262; bufend = 59800449; bufvoid = 99614720
2010-10-20 14:50:44,113 INFO org.apache.hadoop.mapred.MapTask: kvstart = 6601; kvend = 9039; length = 327680
2010-10-20 14:50:44,864 INFO org.apache.hadoop.mapred.MapTask: Finished spill 1
2010-10-20 14:58:33,105 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2010-10-20 14:58:33,105 INFO org.apache.hadoop.mapred.MapTask: bufstart = 59800449; bufend = 39893455; bufvoid = 99614720
2010-10-20 14:58:33,105 INFO org.apache.hadoop.mapred.MapTask: kvstart = 9039; kvend = 11228; length = 327680
2010-10-20 14:58:33,817 INFO org.apache.hadoop.mapred.MapTask: Finished spill 2
2010-10-20 15:06:48,675 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2010-10-20 15:06:48,675 INFO org.apache.hadoop.mapred.MapTask: bufstart = 39893455; bufend = 20000988; bufvoid = 99614720
2010-10-20 15:06:48,675 INFO org.apache.hadoop.mapred.MapTask: kvstart = 11228; kvend = 13286; length = 327680
2010-10-20 15:06:49,395 INFO org.apache.hadoop.mapred.MapTask: Finished spill 3
2010-10-20 15:15:23,514 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2010-10-20 15:15:23,514 INFO org.apache.hadoop.mapred.MapTask: bufstart = 20000988; bufend = 78879; bufvoid = 99614720
2010-10-20 15:15:23,514 INFO org.apache.hadoop.mapred.MapTask: kvstart = 13286; kvend = 15265; length = 327680
2010-10-20 15:15:24,230 INFO org.apache.hadoop.mapred.MapTask: Finished spill 4
2010-10-20 15:24:35,797 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2010-10-20 15:24:35,797 INFO org.apache.hadoop.mapred.MapTask: bufstart = 78879; bufend = 79807573; bufvoid = 99614720
2010-10-20 15:24:35,797 INFO org.apache.hadoop.mapred.MapTask: kvstart = 15265; kvend = 17188; length = 327680
2010-10-20 15:24:36,500 INFO org.apache.hadoop.mapred.MapTask: Finished spill 5
2010-10-20 15:33:33,391 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2010-10-20 15:33:33,391 INFO org.apache.hadoop.mapred.MapTask: bufstart = 79807573; bufend = 59907680; bufvoid = 99614720
2010-10-20 15:33:33,391 INFO org.apache.hadoop.mapred.MapTask: kvstart = 17188; kvend = 19074; length = 327680
2010-10-20 15:33:34,114 INFO org.apache.hadoop.mapred.MapTask: Finished spill 6
2010-10-20 15:42:39,913 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2010-10-20 15:42:39,913 INFO org.apache.hadoop.mapred.MapTask: bufstart = 59907680; bufend = 40011208; bufvoid = 99614720
2010-10-20 15:42:39,913 INFO org.apache.hadoop.mapred.MapTask: kvstart = 19074; kvend = 20926; length = 327680
2010-10-20 15:42:40,597 INFO org.apache.hadoop.mapred.MapTask: Finished spill 7
2010-10-20 15:51:49,668 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2010-10-20 15:51:49,668 INFO org.apache.hadoop.mapred.MapTask: bufstart = 40011208; bufend = 20111383; bufvoid = 99614720
2010-10-20 15:51:49,668 INFO org.apache.hadoop.mapred.MapTask: kvstart = 20926; kvend = 22759; length = 327680
2010-10-20 15:51:50,378 INFO org.apache.hadoop.mapred.MapTask: Finished spill 8
2010-10-20 16:01:05,893 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2010-10-20 16:01:05,893 INFO org.apache.hadoop.mapred.MapTask: bufstart = 20111383; bufend = 196929; bufvoid = 99614720
2010-10-20 16:01:05,894 INFO org.apache.hadoop.mapred.MapTask: kvstart = 22759; kvend = 24572; length = 327680
2010-10-20 16:01:06,634 INFO org.apache.hadoop.mapred.MapTask: Finished spill 9
2010-10-20 16:10:25,000 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2010-10-20 16:10:25,000 INFO org.apache.hadoop.mapred.MapTask: bufstart = 196929; bufend = 79900267; bufvoid = 99614720
2010-10-20 16:10:25,000 INFO org.apache.hadoop.mapred.MapTask: kvstart = 24572; kvend = 26370; length = 327680
2010-10-20 16:10:25,776 INFO org.apache.hadoop.mapred.MapTask: Finished spill 10
2010-10-20 16:19:48,283 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true
2010-10-20 16:19:48,283 INFO org.apache.hadoop.mapred.MapTask: bufstart = 79900267; bufend = 59993676; bufvoid = 99614720
2010-10-20 16:19:48,284 INFO org.apache.hadoop.mapred.MapTask: kvstart = 26370; kvend = 28152; length = 327680
2010-10-20 16:19:49,042 INFO org.apache.hadoop.mapred.MapTask: Finished spill 11

Thank you

+1  A: 

There is nothing in Hadoop that will know that a particular invokation of map() is emitting an inordinate amount of key-value pairs. I'm guessing that in your map() function there is some kind of loop that emits these key-value pairs. You can simply code the loop to short circuit if it emits more than N pairs.

Another option is to figure out some way to partition the input values so that the mappers are dealing with more granular chunks, so that all the mappers are doing roughly the same amount of work.

I'm not sure exactly what you're trying to do, so these suggestions might not apply. Hope this helps.

bajafresh4life