Having two sets of input combined on hadoop

I have a rather simple hadoop question which I'll try to present with an example

say you have a list of strings and a large file and you want each mapper to process a piece of the file and one of the strings in a grep like program.

how are you supposed to do that? I am under the impression that the number of mappers is a result of the inputSplits produced. I could run subsequent jobs, one for each string, but it seems kinda... messy?

edit: I am not actually trying to build a grep map reduce version. I used it as an example of having 2 different inputs to a mapper. Let's just say that I lists A and B and would like for a mapper to work on 1 element from list A and 1 element from list B

So given that the problem experiences no data dependency that would result in the need for chaining jobs, is my only option to somehow share all of list A on all mappers and then input 1 element of list B to each mapper?

What I am trying to do is built some type of a prefixed look-up structure for my data. So I have a giant text and a set of strings. This process has a strong memory bottleneck, therefore I was after 1 chunk of text/1 string per mapper

Mappers should be able to work independent and w/o side effects. The parallelism can be, that a mapper tries to match a line with all patterns. Each input is only processed once!

Otherwise you could multiply each input line with the number of patterns. Process each line with a single pattern. And run the reducer afterwards. A ChainMapper is the solution of choice here. But remember: A line will appear twice, if it matches two patterns. Is that what you want?

In my opinion you should prefer the first scenario: Each mapper processes a line independently and checks it against all known patterns.

Hint: You can distribute the patterns with the DistributedCache feature to all mappers! ;-) Input should be splitted with the InputLineFormat

thanks for your answer, I have edited my question trying to clarify what I'm actually after

aeolist 2010-04-29 16:16:50

Ah yes, my bad, you do mention ChainMapper and all that, it just wasn't clear to me how to model the problem after that at the time. The "multiply each input line with the number of patterns. Process each line with a single pattern" is still a little vague to me. anyway I'll try that approach.

aeolist 2010-05-03 17:07:48

you are a right, a vote is in order. As I said I need n*m mappers, is there any other way to do that other than my answer above? Again, having every mapper process all strings for its respective split is a bad scenario - my processing is just too memory heavy.

aeolist 2010-05-03 16:55:27

Well, ChainMappers and Passing variables via the context doesn't seem new to me, regarding my two posts from before...Anyway: Each mapper starting a new jobs sounds horrible to me! If you have n strings, you will have n jobs?! *urgs* :-( Every mapper should process all strings at once to a record. Listen to me ;-)

Peter Wippermann 2010-05-03 16:41:19

i think i really need this. my mapper is going to construct a data structure 10 times as big as the split it gets. I don't want to go to cache etc. Another proposed idea is to do this n*m to the data itself: combining everything a priori in the input folder. imagine me having 300 strings and a 10GB input sequence. Yes that idea involves 300*10GB input... ugly

aeolist 2010-05-03 16:53:53

so the question is, a small number of jobs dealing with lots of data, or a larger number of jobs dealing with fewer data? I'll go with the second. btw chainmapper is not actually helpful, since i'm thinking something in the lines of:start a job, no reducer, point it to input_strings. this mapper has a for loop that start a new job, set the string, etc.

aeolist 2010-05-03 17:59:26

Ok, I understand your scenario a bit better. Let me state, that it is ok and even intended by the framework, that the mappers (and reducers) are some kind of busy with the work they have. Basically Hadoop was built for a large throughput and to process large data. But the map()/reduce() don't have to be a few lines of code! It's ok if they need a bit longer. That's why the reportProgress() methods have been designed to tell the master node, that the slave is still alive but busy.

Peter Wippermann 2010-05-04 06:50:11

Every Job means overhead, so you should avoid having many jobs. Instead it's ok if a few jobs run for hours. BTW: Starting several new jobs within a job - you know, that jobs are executed sequentially and not in parallel?

Peter Wippermann 2010-05-04 06:52:12

that's really bad, i didn't know about the parallel thing. There is also another detail making things difficult for me. My input data is a big sequence and i am using FixedLengthInputFormat to split it up. It takes care of everything for me, splitting up my sequence in string_length pieces. if I do 1 job per sequence split, I'm going to need to change that too... need to think about it.

aeolist 2010-05-04 07:20:30

"you know, that jobs are executed sequentially and not in parallel?"how is this true, can't i use job.submit() instead of job.waitForCompletion()??

aeolist 2010-05-04 07:29:20

ansaurus

tags:

views:

answers:

Having two sets of input combined on hadoop

related questions