views:

35

answers:

2

I'm working with a team of mine on a small application that takes a lot of input (logfiles of a day) and produces useful output after several (now 4, in the future perhaps 10) map-reduce steps (Hadoop & Java).

Now I've done a partial POC of this app and ran it on 4 old desktops (my Hadoop test cluster). What I've noticed is that if you do the partitioning "wrong" the horizontal-scaling characteristics are wrecked beyond recognition. I found that comparing a test run on a single node (say 20 minutes) and on all 4 nodes only resulted in 50% speedup (about 10 minutes) where I expected a 75% (or at least >70%) speedup (about 5 or 6 minutes).

The general principle of making map-reduce scale horizontally is to ensure that the partitions are as independent as possible. I found that in my case I did the partitioning of each step "wrong" because I simply used the default Hash partitioner; this make the records jump around to a different partition in the next map-reduce step.

I expect (haven't tried it yet) that I can speed the thing up and make is scale much better if I can convince as many records as possible to stay in the same partition (i.e. build a custom partitioner).

In the above described case I found this solution by hand. I deduced what went wrong by thinking hard about this while in my car to work.

Now my question to you all: - What tools are available to detect issues like this? - Are there any guidelines/checklists to follow? - How do I go about measuring things like "the number of records that jumped partition"?

Any suggestions (tools, tutorials, book, ...) are greatly appreciated.

A: 

Make sure that you're not running into the small files problem. Hadoop is optimized for through-put rather than latency, so it will process many log-files joined into one large sequence file much more quickly than it will many individual files stored on the hdfs. Using sequences files in this way eliminates extra time needed do house-keeping for individual map and reduce tasks and improves data locality. But yes, it's important that your map outputs are reasonably well distributed to the reducers, to ensure that a few reducers are not overloaded with a disproportionate amount of work.

Jakob Homan
A: 

Take a look at Karmashpere (formerly known as hadoop studio) plugin for Netbeans/Eclipse : http://karmasphere.com/Download/download.html. There's free version that can help with detecting and test-running hadoop jobs.
I have tested it a little and it looks promising.

Wojtek