views:

258

answers:

3

I've read some documentation about hadoop and seen the impressive results. I get the bigger picture but am finding it hard whether it would fit our setup. Question isnt programming related but I'm eager to get opinion of people who currently work with hadoop and how it would fit our setup:

  • We use Oracle for backend
  • Java (Struts2/Servlets/iBatis) for frontend
  • Nightly we get data which needs to be summarized. this runs as a batch process (takes 5 hours)

We are looking for a way to cut those 5 hours to a shorter time.

Where would hadoop fit into this picture? Can we still continue to use Oracle even after hadoop?

+2  A: 

Hadoop is designed to parallelize a job across multiple machines. To determine whether it will be a good candidate for your setup, ask yourself these questions:

  • Do I have many machines on which I can run Hadoop, or am I willing to spend money on something like EC2?

  • Is my job parallelizable? (If your 5 hour batch process consists of 30 10-minute tasks that have to be run in sequence, Hadoop will not help you).

  • Does my data require random access? (This is actually pretty significant - Hadoop is great at sequential access and terrible at random access. In the latter case, you won't see enough speedup to justify the extra work / cost).

As far as where it "fits in" - you give Hadoop a bunch of data, and it gives you back output. One way to think of it is like a giant Unix process - data goes in, data comes out. What you do with it is your business. (This is of course an overly simplified view, but you get the idea.) So yes, you will still be able to write data to your Oracle database.

danben
second step is a fail in my case. though they are not 10 minute tasks but each task is dependent on completion of the previous task. are there any other solutions or we should just concentrate on optimizing our queries..
Omnipresent
Well, if the individual tasks are long-running you can parallelize each of them. Say you have five tasks that take one hour each. If you had a cluster of five machines, you could theoretically speed a task up to 12 minutes (60 / 5). You could also use the same cluster to run each of the tasks, so this would cut your total time down to one hour rather than 5. If the tasks are short, you wont see this benefit as the overhead of setting up the job will outweigh the speedup. So to summarize, this will work if you can break your job into individual pieces that are both
danben
long-running and parallelizable.
danben
A: 

Hadoop distributed filesystem supports highly paralleled batch processing of data using MapReduce.

So your current process takes 5 hours to summarize the data. Of the bat, general summarization tasks are one of the 'types' of job MapReduce excels at. However you need to understand weather your processing requirements will translate into a MapReduce job. By this I mean, can you achieve the summaries you need using the key/value pairs MapReduce limits you to using?

Hadoop requires a cluster of machines to run. Do you have hardware to support a cluster? This usually comes down to how much data you are storing on the HDFS and also how fast you want to process the data. Generally when running MapReduce on a Hadoop the more machines you have either the more data you can store or the faster you run a job. Having an idea of the amount of data you process each night would help a lot here?

You can still use Oracle. You can use Hadoop/MapReduce to do the data crunching and then use custom code to insert the summary data into an oracle DB.

Binary Nerd
+2  A: 

The chances are you can dramatically reduce the elapsed time of that batch process with some straightforward tuning. I offer this analysis on the simple basis of past experience. Batch processes tend to be written very poorly, precisely because they are autonomous and so don't have irate users demanding better response times.

Certainly I don't think it makes any sense at all to invest a lot of time and energy re-implementing our application in a new technology - no matter how fresh and cool it may be - until we have exhausted the capabilities of our current architecture.

If you want some specific advice on how to tune your batch query, well that would be a new question.

APC