ansaurus

Question

How to combine multiple Hadoop MapReduce Jobs into one?

Answer 1

A:

I don't understand why you're not just writing different jobs with different mappers, feeding in the same input dataset. could you explain better why the default practice doesn't fit to your workflow?

marcorossi 2010-06-30 15:26:16

I need to be as fast as possible. I suspect that sequential jobs are slower because of their overhead.

stefanw 2010-07-01 00:36:31

Answer 2

+2 A:

Is this a sensible approach?

There's nothing inherently wrong with it, other than the coupling of the maintenance of the different jobs' logic. I believe it will save you on some disk I/O, which could be a win if your disk is a bottleneck for your process (on small clusters this can be the case).

Are there better alternatives?

It may be prudent to write a somewhat framework-y Mapper and Reducer which each accept as configuration parameters references to the classes to which they should defer for the actual mapping and reducing. This may solve the aforementioned coupling of the code (maybe you've already thought of this).

Does it have some terrible drawback?

The only thing I can think of is that if one of the tasks' map logic fails to complete its work in a timely manner, the scheduler may fire up another node to process that piece of input data; this could result in duplicate work, but without knowing more about your process, it's hard to say whether this would matter much. The same would hold for the reducers.

Would I need a custom Partitioner class for this approach?

Probably, depending on what you're doing. I think in general if you're writing a custom output WritableComparable, you'll need custom partitioning as well. There may be some library Partitioner that could be configurable for your needs, though (such as KeyFieldBasedPartitioner, if you make your output of type Text and using String field-separators instead of rolling your own).

HTH. If you can give a little more context, maybe I could offer more advice. Good luck!

cgs1019 2010-06-30 20:33:50

Thanks so far! I've added some more context and would love to hear your thoughts on that.

stefanw 2010-07-01 00:38:12

Answer 3

A:

You can use:

Cascading
Oozie

Both are used to write workflows in hadoop.

balaji 2010-08-24 07:43:06

ansaurus

tags:

views:

answers:

How to combine multiple Hadoop MapReduce Jobs into one?

related questions