views:

214

answers:

1

I want to do log parsing of huge amounts of data and gather analytic information. However all the data comes from external sources and I have only 2 machines to store - one as backup/replication.

I'm trying to using Hadoop, Lucene... to accomplish that. But, all the training docs mention that Hadoop is useful for distributed processing, multi-node. My setup does not fit into that architecture.

Are they any overheads with using Hadoop with just 2 machines? If Hadoop is not a good choice are there alternatives? We looked at Splunk, we like it, but that is expensive for us to buy. We just want to build our own.

A: 

Hadoop should be used for distributed batch processing problems.

5-common-questions-about-hadoop

Analysis of log files is one of the more common uses of Hadoop, its one of the tasks Facebook use it for.

If you have two machines, you by definition have a multi-node cluster. You can use Hadoop on a single machine if you want, but as you add more nodes the time it takes to process the same amount of data is reduced.

You say you have huge amounts of data? These are important numbers to understand. Personally when I think huge in terms of data, i think in the 100s terabytes+ range. If this is the case, you'll probably need more than two machines, especially if you want to use replication over the HDFS.

The analytic information you want to gather? Have you determined that these questions can be answered using the MapReduce approach?

Something you could consider would be to use Hadoop on Amazons EC2 if you have a limited amount of hardware resources. Here are some links to get you started:

Binary Nerd
Thanks.For next couple years, we might not have more than 5 terabytes. I have some learning to do... our idea is to use map-reduce to answer the analytical questions like user login data, server failures rates, etc.. general information gathered from logs. I read the Rackspace implementation of distributed log parsing using hadoop and so I'm trying to test this out.
chandra
So if you have 5 terabytes, and you use a replication factor of 2, you should ensure you have 5TB on each machine for the data and a few more TBs for the output from your MapReduce jobs. I'd check out the book: Hadoop: The Definitive Guide, by Tom White. Its a good resource.
Binary Nerd