views:

218

answers:

3

I have a rather simple hadoop question which I'll try to present with an example

say you have a list of strings and a large file and you want each mapper to process a piece of the file and one of the strings in a grep like program.

how are you supposed to do that? I am under the impression that the number of mappers is a result of the inputSplits produced. I could run subsequent jobs, one for each string, but it seems kinda... messy?

edit: I am not actually trying to build a grep map reduce version. I used it as an example of having 2 different inputs to a mapper. Let's just say that I lists A and B and would like for a mapper to work on 1 element from list A and 1 element from list B

So given that the problem experiences no data dependency that would result in the need for chaining jobs, is my only option to somehow share all of list A on all mappers and then input 1 element of list B to each mapper?

What I am trying to do is built some type of a prefixed look-up structure for my data. So I have a giant text and a set of strings. This process has a strong memory bottleneck, therefore I was after 1 chunk of text/1 string per mapper

+1  A: 

Mappers should be able to work independent and w/o side effects. The parallelism can be, that a mapper tries to match a line with all patterns. Each input is only processed once!

Otherwise you could multiply each input line with the number of patterns. Process each line with a single pattern. And run the reducer afterwards. A ChainMapper is the solution of choice here. But remember: A line will appear twice, if it matches two patterns. Is that what you want?

In my opinion you should prefer the first scenario: Each mapper processes a line independently and checks it against all known patterns.

Hint: You can distribute the patterns with the DistributedCache feature to all mappers! ;-) Input should be splitted with the InputLineFormat

Peter Wippermann
thanks for your answer, I have edited my question trying to clarify what I'm actually after
aeolist
Ah yes, my bad, you do mention ChainMapper and all that, it just wasn't clear to me how to model the problem after that at the time. The "multiply each input line with the number of patterns. Process each line with a single pattern" is still a little vague to me. anyway I'll try that approach.
aeolist
A: 

Regarding your edit: In general a mapper is not used to process 2 elements at once. He shall only process one element a time. The job should be designed in a way, that there could be a mapper for each input record and it would still run correctly!

Of course it is suitable, that the mapper needs some supporting information to process the input. This information can be by-passed with the Job Configuration (Configuration.setString() for example). A larger set of data shall be passed via the distributed cache.

Did you have a look on one of these options? I'm not sure if I fully understood your problem, so please check by yourself if that would work ;-)

BTW: A appreciating vote for my well investigated previous answer would be nice ;-)

Peter Wippermann
you are a right, a vote is in order. As I said I need n*m mappers, is there any other way to do that other than my answer above? Again, having every mapper process all strings for its respective split is a bad scenario - my processing is just too memory heavy.
aeolist
A: 

a good friend had a great epiphany: what about chaining 2 mappers?

in the main, run a job that fires up a mapper (no reducer). The input is the list of strings, and we can arrange things so that each mapper gets one string only.

in turn, the first mapper starts a new job, where the input is the text. It can communicate the string by setting a variable in the context.

aeolist
Well, ChainMappers and Passing variables via the context doesn't seem new to me, regarding my two posts from before...Anyway: Each mapper starting a new jobs sounds horrible to me! If you have n strings, you will have n jobs?! *urgs* :-( Every mapper should process all strings at once to a record. Listen to me ;-)
Peter Wippermann
i think i really need this. my mapper is going to construct a data structure 10 times as big as the split it gets. I don't want to go to cache etc. Another proposed idea is to do this n*m to the data itself: combining everything a priori in the input folder. imagine me having 300 strings and a 10GB input sequence. Yes that idea involves 300*10GB input... ugly
aeolist
so the question is, a small number of jobs dealing with lots of data, or a larger number of jobs dealing with fewer data? I'll go with the second. btw chainmapper is not actually helpful, since i'm thinking something in the lines of:start a job, no reducer, point it to input_strings. this mapper has a for loop that start a new job, set the string, etc.
aeolist
Ok, I understand your scenario a bit better. Let me state, that it is ok and even intended by the framework, that the mappers (and reducers) are some kind of busy with the work they have. Basically Hadoop was built for a large throughput and to process large data. But the map()/reduce() don't have to be a few lines of code! It's ok if they need a bit longer. That's why the reportProgress() methods have been designed to tell the master node, that the slave is still alive but busy.
Peter Wippermann
Every Job means overhead, so you should avoid having many jobs. Instead it's ok if a few jobs run for hours. BTW: Starting several new jobs within a job - you know, that jobs are executed sequentially and not in parallel?
Peter Wippermann
that's really bad, i didn't know about the parallel thing. There is also another detail making things difficult for me. My input data is a big sequence and i am using FixedLengthInputFormat to split it up. It takes care of everything for me, splitting up my sequence in string_length pieces. if I do 1 job per sequence split, I'm going to need to change that too... need to think about it.
aeolist
"you know, that jobs are executed sequentially and not in parallel?"how is this true, can't i use job.submit() instead of job.waitForCompletion()??
aeolist