views:

385

answers:

2

I am running an experiment distributed over several computers using the Parallel computing toolbox. I want to be able to produce a log of the progress of the experiment (or of any error occurring) and save this info in a file while the processes are running. What is the standard way to do it?

EDIT:

  1. I use embarrassingly parallel
  2. I want only one file for all the workers (I have a network drive that can be accessed from all the machine)

My main concern is having a file opened for append by several workers. Do I risk losing messages, or having an error opening the file?

+2  A: 

Assuming that you are doing embarrassingly parallel (that is the job and task structure) and that you want the log file updated at the end of each task, I would use the taskFinish callback

http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/taskfinish.html

You could also just write to a file somehow in the middle of your task as you would in MATLAB normally, but I think you are asking about callbacks at the end of the task.

MatlabDoug
+1  A: 

When multiple processes output to a single file, you could run into some potential problems, like messages being overwritten or intermingled. I've had this happen with programs in other languages (like C), and I assume the same problem could arise in MATLAB, but I freely admit I could be wrong about this. Assuming I'm not wrong...

If you want to reliably output data from multiple worker processes to a single log file while the processes are running, one way to do this is to make one process be responsible for all the file operations (i.e. a "master" process). The "master" process would collect messages from the other workers (i.e. "slaves") and output this data to the log file.

Since I don't know what specifically you are having each process do, it's hard to suggest specific code changes to make. Here are some steps and sample code for how you might do this in MATLAB. These code samples assume you are running the same function (process_fcn) on each process:

  • The "master" process first has to open the file. This code (using the labindex function) should be run at the beginning of process_fcn:

    if (labindex == 1),
      fid = fopen('log.txt','at');  % Open text file for appending
    end
    
  • While each process is running, you can collect any data that needs to be output to the log file in a variable called data, which stores a string or character array. This data could be error messages captured within a try-catch block or any other data that you would want to be in the log file.

  • At periodic points in process_fcn (either when major tasks are completed or within a loop of computation), you would have to have each process check for data that needs to be output (i.e. data is not empty) and have that data sent to the "master" process. The "master" process would then collect and print these messages from other processes, along with any of its own. Here's a sample of how this might be done (using the functions labBarrier, labProbe, labSend, and labReceive):

    labBarrier;  % All processes are synchronized here
    if (labindex == 1),  % This is done by the "master"
      if ~isempty(data),
        fprintf(fid,'%s\n',data);  % Print "master" data
      end
      pause(1);  % Wait a moment for "slaves" to send messages
      while labProbe,  % Loop while messages are available
        data = labReceive;  % Get data from "slaves"
        fprintf(fid,'%s\n',data);
      end
    else  % This is done by the "slaves"
      if ~isempty(data),
        labSend(data,1);  % Send data to the "master"
      end
    end
    data = '';  % Clear data
    

    The call to PAUSE is there to ensure that the calls to labSend for each "slave" process occur before the "master" starts looking for sent messages.

  • Finally, the "master" process has to close the file. This code should be run at the end of process_fcn:

    if (labindex == 1),
      fclose(fid);
    end
    
gnovice
I'm not sure about the labBarrier function (because my jobs don't have the same running time). But this give me a good starting point.
Eolmar
If your jobs have vastly different running times then synchronizing them for outputting data gets tricky. I'm unsure if labSend is a *blocking* call (process waits for a matching receive) or *non-blocking* call (message is buffered and the process moves on, whether a receive is posted or not). My above code assumes it's blocking... if it's non-blocking it would probably simplify things for you (i.e. no labBarrier or PAUSE would be needed).
gnovice

related questions