views:

236

answers:

3

I have a massive amount of input data (that's why I use Hadoop) and there are multiple tasks that can be solved with various MapReduce steps of which the first mapper needs all the data as input.

My goal: Compute these different tasks as fast as possible.

I currently let them run sequentially each reading in all the data. I assume it will be faster when combining the tasks and executing their similar parts (like feeding all data to the mapper) only once.

I was wondering if and how I can combine these tasks. For every input key/value pair the mapper could emit a "super key" that includes a task id and the task specific key data along with a value. This way reducers would get key/value pairs for a task and a task-specific key and could decide when seeing the "superkey" which task to perform on the included key and values.

In pseudo code:

map(key, value):
    emit(SuperKey("Task 1", IncludedKey), value)
    emit(SuperKey("Task 2", AnotherIncludedKey), value)

reduce(key, values):
   if key.taskid == "Task 1":
      for value in values:
          // do stuff with key.includedkey and value
   else:
      // do something else

The key could be a WritableComparable which can include all the necessary information.

Note: the pseudo code suggests a terrible architecture and it can definitely be done in a smarter way.

My questions are:

  • Is this a sensible approach?
  • Are there better alternatives?
  • Does it have some terrible drawback?
  • Would I need a custom Partitioner class for this approach?

Context: The data consists of some millions of RDF quadruples and the tasks are to calculate clusters, statistics and similarities. Some tasks can be solved easily with just Hadoop Counters in a reducer, but some need multiple MapReduce steps.

The computation will eventually take place on Amazon's Elastic MapReduce. All tasks are to be computed on the whole dataset and as fast as possible.

A: 

I don't understand why you're not just writing different jobs with different mappers, feeding in the same input dataset. could you explain better why the default practice doesn't fit to your workflow?

marcorossi
I need to be as fast as possible. I suspect that sequential jobs are slower because of their overhead.
stefanw
+2  A: 
  • Is this a sensible approach?

There's nothing inherently wrong with it, other than the coupling of the maintenance of the different jobs' logic. I believe it will save you on some disk I/O, which could be a win if your disk is a bottleneck for your process (on small clusters this can be the case).

  • Are there better alternatives?

It may be prudent to write a somewhat framework-y Mapper and Reducer which each accept as configuration parameters references to the classes to which they should defer for the actual mapping and reducing. This may solve the aforementioned coupling of the code (maybe you've already thought of this).

  • Does it have some terrible drawback?

The only thing I can think of is that if one of the tasks' map logic fails to complete its work in a timely manner, the scheduler may fire up another node to process that piece of input data; this could result in duplicate work, but without knowing more about your process, it's hard to say whether this would matter much. The same would hold for the reducers.

  • Would I need a custom Partitioner class for this approach?

Probably, depending on what you're doing. I think in general if you're writing a custom output WritableComparable, you'll need custom partitioning as well. There may be some library Partitioner that could be configurable for your needs, though (such as KeyFieldBasedPartitioner, if you make your output of type Text and using String field-separators instead of rolling your own).

HTH. If you can give a little more context, maybe I could offer more advice. Good luck!

cgs1019
Thanks so far! I've added some more context and would love to hear your thoughts on that.
stefanw
A: 

You can use:

  1. Cascading
  2. Oozie

Both are used to write workflows in hadoop.

balaji