tags:

views:

94

answers:

3

I'm training a typical map-reduce architecture (in O.S. classes) and I'm free to decide how the master process will tell its N child processes to parse a log. So, I'm kind of stuck in these two possibilities:

  1. count the number of rows and give X rows for each map OR

  2. each map reads the line of its ID and the next line to read= current_one+number_of_existent_maps E.g.: with 3 maps, each one is going to read these lines:

    • Map1: 1, 4, 7, 10, 13
    • Map2: 2, 5, 8, 11, 14
    • Map3: 3, 6, 9, 12, 15

I have to do this in order to out-perform a single process that parses the entire log file, so the way I split the job between child processes has to be consistent with this objective.

Which one do you think is best? How can I do the scanf or fgets to adapt to 1) or 2)?

I would be happy with some example code for 2), because the fork/pipes are not my problem :P

RE-EDIT: I'm not encouraged to use select here, only between map procs and the reduce process that will be monitoring the reads. I have restrictions now and :

I want each process to read total_lines/N lines each. But it seems like I have to make map procs open the file and then read the respective lines. So here are my doubts:

1- Is it bad or even possible to make every procs open the file simultaneously or almost simultaneously? How will that help in speeding up?

2- If it isn't possible to do that, I will have a parent opening the file (instead of each child doing that)that sends a struct with min and max limit and then the map procs will read whatever the lines they are responsible for, process them and give the reduce process a result (this doesn't matter for the problem now).

How can I divide correctly the number of lines by N maps and putting them to read at the same time? I think fseek() may be a good weapon, but I don't know HOW I can use it. Help, please!

+5  A: 

If I understood correctly, you want to have all processes reading lines from a single file. I don't recommend this, it's kinda messy, and you'll have to a) read the same parts of the file several times or b) use locking/mutex or some other mechanism to avoid that. It'll get complicated and hard to debug.

I'd have a master process read the file, and assign lines to a subprocess pool. You can use shared memory to speed this up, and reduce the need for data-copying IPC; or use threads.

As for examples, I answered a question about forking and IPC and gave a code snippet on an example function that forks and returns a pair of pipes for parent-child communication. Let me look that up (...) here it is =P http://stackoverflow.com/questions/3884103/can-popen-make-bidirectional-pipes-like-pipe-fork/3884402#3884402

edit: I kept thinking about this =P. Here's an idea:

  • Have a master process spawn subprocesses with something similar to what I showed in the link above.

  • Each process starts by sending a byte up to the master to signal it's ready, and blocking on read().

  • Have the master process read a line from the file to a shared memory buffer, and block on select() on its children pipes.

  • When select() returns, read one of the bytes that signal readiness and send to that subprocess the offset of the line in the shared memory space.

  • The master process repeats (reads a line, blocks on select, reads a byte to consume the readiness event, etc.)

  • The children process the line in whatever way you need, then send a byte to the master to signal readiness once again.

(You can avoid the shared memory buffer if you want, and send the lines down the pipes, though it'll involve constant data-copying. If the processing of each line is computationally expensive, it won't really make a difference; but if the lines require little processing, it may slow you down).

I hope this helps!

edit 2 based on Newba's comments:

Okay, so no shared memory. Use the above model, only instead of sending down the pipe the offset of the line read in the shared memory space, send the whole line. This may sound to you like you're wasting time when you could just read it from the file, but trust me, you're not. Pipes are orders of magnitude faster than reads from regular files in a hard disk, and if you wanted subprocesses to read directly from the file, you'll run into the problem I pointed at the start of the answer.

So, master process:

  • Spawn subprocesses using something like the function I wrote (link above) that creates pipes for bidirectional communication.

  • Read a line from the file into a buffer (private, local, no shared memory whatsoever).

  • You now have data ready to be processed. Call select() to block on all the pipes that communicate you with your subprocesses.

  • Choose any of the pipes that have data available, read one byte from it, and then send the line you have waiting to be processed in the buffer down the corresponding pipe (remember, we had 2 per child process, on to go up, one to go down).

  • Repeat from step 2, i.e. read another line.

Child processes:

  • When they start, they have a reading pipe and a writing pipe at their disposal. Send a byte down your writing pipe to signal the master process you are ready and waiting for data to process (this is the single byte we read in step 4 above).

  • Block on read(), waiting for the master process (that knows you are ready because of step 1) to send you data to process. Keep reading until you reach a newline (you said you were reading lines, right?). Note I'm following your model, sending a single line to each process at a time, you could send multiple lines if you wanted.

  • Process the data.

  • Return to step 1, i.e. send another byte to signal you are ready for more data.

There you go, simple protocol to assign tasks to as many subprocesses as you want. It may be interesting to run a test with 1 child, n children (where n is the number of cores in your computer) and more than n children, and compare performances.

Whew, that was a long answer. I really hope I helped xD

Santiago Lezica
Thanks, but I cannot use shared memory yet, just pipes, named pipes and child processes. The select it's to be used in reduce process that waits for the results form the maps procs and sums every parcel and then presents it on the screen. My difficulty is how do I use all these limitations to outperform a single_process program? I will not use much more than 2, 3 procs, but I understand that counting lines and giving each one of them to each proc is messy and probably not the faster way, but it has to work, the assignment has that "challenge" besides learning pipes, select, fork, etc.
neverMind
Well, skip the shared memory part. Use pipes for communication. Instead of copying the data into shared memory, copy it on the master process' own memory and then send it down the pipe. It's a simple protocol, you don't need to define any additional rules. About performance, trust me: the approach I detailed is faster than a single-process approach, and even more so on a multicore computer.
Santiago Lezica
Yes, I have N pipes, each one for each map proc. But how do I say "Hey child you read these and that line and the other child read that line, that line and that line..."? I can send wtv I want in a pipe, I don't know what to send, that's my problem.
neverMind
Okay, I'm gonna edit the answer, there's little room for explanation in here =P
Santiago Lezica
There you go! Edited.
Santiago Lezica
Just to be clear, we have ONE pipe for each child,so it corresponds to having fd[0] for reading and fd[1] for writing. We write pipe(fd), with int fd[2], which means we have 2 fd's for each pipe and not 2 pipes per child (which will give 4 fd's).
neverMind
You can only spawn one pipe per child? Is that a limitation? =/ well, you CAN do what I said with only one pipe, since the master and children take turns to read and write anyway. Synchronize them carefully!
Santiago Lezica
A: 

Since each of the processes is going to have to read the file in its entirety (unless the log lines are all of the same length, which is unusual), there really isn't a benefit to your proposal 2.

If you are going to split up the work into 3, then I would do:

  • Measure (stat()) the size of the log file - call it N bytes.
  • Allocate the range of bytes 0..(N/3) to first child.
  • Allocate the range of bytes (N/3)+1..2(N/3) to the second child.
  • Allocate the range of bytes 2(N/3)+1..end to the third child.
  • Define that the second and third children must synchronize by reading forward to the first line break after their start position.
  • Define that each child is responsible for reading to the first line break on or after the end of their range.
  • Note that the third child (last child) may have to do more work if the log file is growing.

Then the processes are reading independent segments of the file.

(Of course, with them all sharing the file, then the system buffer pool saves rereading the disk, but the data is still copied to each of the three processes, only to have each process throw away 2/3 of what was copied as someone else's job.)


Another, more radical option:

  • mmap() the log file into memory.
  • Assign the children to different segments of the file along the lines described previously.

If you're on a 64-bit machine, this works pretty well. If your log files are not too massive (say 1 GB or less), you can do it on a 32-bit machine too. As the file size grows above 1 GB or so, you may start running into memory mapping and allocation issues, though you might well get away with it until you reach a size somewhat less than 4 GB (on a 32-bit machine). The other issue here is with growing log files. AFAIK, mmap() doesn't map extra memory as extra data is written to the file.

Jonathan Leffler
A: 

Use a master and slave queue pattern.

  • The master sets up the slaves which sit waiting on a queue for work items.
  • The master then reads the file line by line.
    • Each line then represents a work item that you put on the queue
      with a function pointer of how do the work.
    • One of the waiting slaves then takes the item of the queue
    • A slave processes a work item.
    • When a slave has finished it rejoins the work queue.
Martin York