Hadoop: Disadvantages of using just 2 machines?

Hadoop should be used for distributed batch processing problems.

Analysis of log files is one of the more common uses of Hadoop, its one of the tasks Facebook use it for.

If you have two machines, you by definition have a multi-node cluster. You can use Hadoop on a single machine if you want, but as you add more nodes the time it takes to process the same amount of data is reduced.

You say you have huge amounts of data? These are important numbers to understand. Personally when I think huge in terms of data, i think in the 100s terabytes+ range. If this is the case, you'll probably need more than two machines, especially if you want to use replication over the HDFS.

The analytic information you want to gather? Have you determined that these questions can be answered using the MapReduce approach?

Something you could consider would be to use Hadoop on Amazons EC2 if you have a limited amount of hardware resources. Here are some links to get you started:

Thanks.For next couple years, we might not have more than 5 terabytes. I have some learning to do... our idea is to use map-reduce to answer the analytical questions like user login data, server failures rates, etc.. general information gathered from logs. I read the Rackspace implementation of distributed log parsing using hadoop and so I'm trying to test this out.

chandra 2010-02-11 18:53:26

So if you have 5 terabytes, and you use a replication factor of 2, you should ensure you have 5TB on each machine for the data and a few more TBs for the output from your MapReduce jobs. I'd check out the book: Hadoop: The Definitive Guide, by Tom White. Its a good resource.

Binary Nerd 2010-02-11 19:07:21

ansaurus

tags:

views:

answers:

Hadoop: Disadvantages of using just 2 machines?

related questions