I'm working with a team of mine on a small application that takes a lot of input (logfiles of a day) and produces useful output after several (now 4, in the future perhaps 10) map-reduce steps (Hadoop & Java).
Now I've done a partial POC of this app and ran it on 4 old desktops (my Hadoop test cluster). What I've noticed is that if you do the partitioning "wrong" the horizontal-scaling characteristics are wrecked beyond recognition. I found that comparing a test run on a single node (say 20 minutes) and on all 4 nodes only resulted in 50% speedup (about 10 minutes) where I expected a 75% (or at least >70%) speedup (about 5 or 6 minutes).
The general principle of making map-reduce scale horizontally is to ensure that the partitions are as independent as possible. I found that in my case I did the partitioning of each step "wrong" because I simply used the default Hash partitioner; this make the records jump around to a different partition in the next map-reduce step.
I expect (haven't tried it yet) that I can speed the thing up and make is scale much better if I can convince as many records as possible to stay in the same partition (i.e. build a custom partitioner).
In the above described case I found this solution by hand. I deduced what went wrong by thinking hard about this while in my car to work.
Now my question to you all: - What tools are available to detect issues like this? - Are there any guidelines/checklists to follow? - How do I go about measuring things like "the number of records that jumped partition"?
Any suggestions (tools, tutorials, book, ...) are greatly appreciated.