We're going to purchase some new hardware to use just for a Hadoop cluster and we're stuck on what we should purchase. Say we have a budget of $5k should we buy two super nice machines at $2500/each, four at around $1200/each or eight at around $600 each? Will hadoop work better with more slower machines or fewest much faster machines? Or, as like most things "it depends"? :-)
You're generally better off with Hadoop getting a few extra machines that are less beefy. You almost never see datanodes with more than 16GB ram and dual quad-core CPUs, and often they are smaller than that.
You always have to run one as the namenode (master), and generally you don't also run a datanode (worker/slave) on the same box, although you could since your cluster is small. Assuming you don't, though, getting 2 machines will leave you only 1 worker node, which somewhat defeats the purpose. (Not entirely, because you can still run 4-8 jobs in parallel on the slave, but still.)
At the same time, you don't want to have a cluster of 1000 486s. If your budget is $5k, I would strike a balance and do 4 $1200 machines. Those will provide a decent baseline in terms of individual performance, you'll have 3 datanodes to distribute work to, and you'll have room to grow your cluster if you need.
Things to keep in mind: you'll want to run multiple map or reduce tasks per datanode, and that means multiple JVMs running simultaneously. I would try to get at least 4GB, and preferably 8GB ram. CPU is less important as most MR jobs are IO bound. You could likely get a machine like this for your $1200 price target, so that's my vote.
I recommend having a look at this presentation: http://www.cloudera.com/hadoop-training-thinking-at-scale Here the various pro's and con's are described.
In a nutshell, you want to max out the number of processor cores and disks. You can sacrifice reliability and quality, but don't get the cheapest hardware out there, as you will have too many reliability problems.
We went with Dell 2xCPU 4-core dell servers, so 8 cores per box. 16GB of memory per box, which is 2GB per core, a bit low as you need memory both for your tasks and for disk buffering. 5x500GB hard drives, and I wish we'd gone for terabyte or higher drives instead.
For drives, my opinion is to buy more cheap, slow, unreliable, high-capacity drives as opposed to more expensive, faster, smaller, reliable drives. If you're having problems with disk throughput, more memory will help with buffering.
This is probably a beefier configuration than you're looking at, but maxing out cores and drives versus buying more boxes is generally a good choice - less power costs, easier to administer, and faster for some operations.
More drives means more simultaneous disk throughput per core, so having as many drives as cores is a good thing. Benchmarking seems to indicate that RAID configurations are slower than JBOD configuration (just mounting the drives and having Hadoop spread load across them) and JBOD is also more reliable.
LAST! Be sure to get ECC memory. Hadoop pushes terabytes of data through memory, and some users have found that non-ECC memory configurations can occasionally introduce single bit errors in terabyte-sized datasets. Debugging these errors is a nightmare.