views:

141

answers:

3

I have a process master that spawns N child processes that communicate with the parent through unnamed pipes. I must be able to:

  • make the father open the file and then send, to each child, a struct telling that it has to read from min to max line;
  • this is going to happen at the same time, so I don't know:
  • 1st how to divide total_lines for N maps and
  • 2nd how do I make each child read just the lines it is supposed to?

My problem does not concern the O.S. concepts, only the file operations :S

Perhaps fseek? I can't mmap the log file (some have more than 1GB).

I would appreciate some ideas. Thank you in advance

EDIT: I'm trying to make the children read the respective lines without using fseek and the value of chunks, so, could someone please tell me if this is valid? :

//somewhere in the parent process:

    FILE* logFile = fopen(filename, "r");
                while (fgets(line, 1024, logFile) != NULL) {
                    num_lines++;
                }
                rewind(logFile);
                int prev = 0;
                for (i = 0; i < maps_nr; i++) {
                    struct send_to_Map request;
                    request.fp = logFile;
                    request.lower = lowLimit;
                    request.upper = highLimit;

                    if (i == 0)
                        request.minLine = 0;
                    else
                        request.minLine = 1 + prev;
                    if(i!=maps_nr-1)
                        request.maxLine = (request.minLine + num_lines / maps_nr) - 1;
                    else
                       request.maxLine = (request.minLine + num_lines / maps_nr)+(num_lines%maps_nr);
                    prev = request.maxLine;

                }
                //write this structure to respective pipe


//child process:

while(1) {
      ...
      //reads the structure to pipe (and knows which lines to read)
      int n=0, counter=0;
      while (fgets(line, 1024, logFile) != NULL){
         if (n>=minLine and n<=maxLine)
              counter+= process(Line);//returns 1 if IP was found, in that line, between the low and high limit
         n++; 
      }
     //(...) 
}

I don't know if it's going to work, I just to make it work! Even like this, is it possible to outperform a single process reading the whole file and printing the total number of ips found in the log file?

+1  A: 

If you don't care about dividing the file exactly evenly, and the distribution of line lengths is somewhat even over the entire file, you can avoid reading the entire file in the parent once.

  1. Get the file size.
  2. chunk_size = file_size / number_of_children
  3. When you spawn each child do in the parent:
    • seek to (child_num+1) * chunk_size
    • Read forward until you find a newline.
    • Spawn the child, telling it to start at the end of the previous chunk (or 0 for the first child), and the actual length of the chunk.
  4. Each child seeks to start and reads chunk_size bytes.

That's a rough sketch of the strategy.

Edited to simplify things a bit.

Edit: here's some untested code for step 3, and step 4 below. This is all untested, and I haven't been careful about off-by-one errors, but it gives you an idea of the usage of fseek and ftell, which sounds like what you are looking for.

// Assume FILE* f is open to the file, chunk_size is the average expected size,
// child_num is the id of the current child, spawn_child() is a function that
// handles the logic of spawning a child and telling it where to start reading,
// and how much to read. child_chunks[] is an array of structs to keep track of
// where the chunks start and how big they are.
if(fseek(f, child_num * chunk_size, SEEK_SET) < 0) { handle_error(); }
int ch;
while((ch = fgetc(f)) != FEOF && ch != '\n')
{/*empty*/}

// FIXME: needs to handle EOF properly.
child_chunks[child_num].end = ftell(f); // FIXME: needs error check.
child_chunks[child_num+1].start = child_chunks[child_num].end + 1;
spawn_child(child_num);

Then in your child (step 4), assume the child has access to child_chunks[] and knows its child_num:

void this_is_the_child(int child_num)
{
    /* ... */

    fseek(f, child_chunks[child_num].start, SEEK_SET); // FIXME: handle error
    while(fgets(...) && ftell(f) < child_chunks[child_num].end)
    {
    }
}
bstpierre
The problem is that each line has very different sizes, but max is 1024 bytes. The children have to read an IP from each line, and if I estimate wrong, they may reach a IP at his middle, or something and that cannot happen of course
neverMind
The second bullet in step 5 prevents you from having the estimation break a line in half -- read forward until you find a newline. If the distribution of line sizes is fairly uniform, you could skip estimating the expected line size and just divide the file length by the number of children.
bstpierre
Can you please exemplify wiht some code ? My doubt doesn't concern the operating systems concepts, just the work with the file :(
neverMind
+1  A: 
/* get an array with line-startpositions (file-offsets) */
fpos_t readLineBegins(FILE *f,fpos_t **begins)
{
  fpos_t ch=0, mark=0, num=0;
  *begins = 0;
  do {
    if( ch=='\n' )
    {
       *begins = realloc( *begins, ++num * sizeof(fpos_t) );
      (*begins)[num-1] = mark;
        mark = ftell(f);
    }
  } while( (ch=fgetc(f))!=EOF );

  if( mark<ftell(f) )
  {
    *begins = realloc( *begins, ++num * sizeof(fpos_t) );
    (*begins)[num-1]=mark;
  }

  return num;
}

/* output linenumber beg...end */
void workLineBlocks(FILE *f,fpos_t *begins,fpos_t beg,fpos_t end)
{
  while( beg<=end )
  {
    int ch;
    fsetpos( f, &begins[beg] ); /* set linestart-position */
    printf("%ld:", ++beg );
    while( (ch=fgetc(f))!=EOF && ch!='\n' && ch!='\r' )
      putchar(ch);
    puts("");
  }
}

main()
{
  FILE *f=fopen("file.txt","rb");
  fpos_t *lineBegins, /* Array with line-startpositions */
  lb = readLineBegins(f,&lineBegins); /* get number of lines */

  workLineBlocks(f,lineBegins,lb-2,lb-1); /* out last two lines */
  workLineBlocks(f,lineBegins,0,1); /* out first two lines */

  fclose(f);
  free(lineBegins);
}
Compiling this code, with the respective includes, gives a lot of errors. Could you tell me if you've tested it? Thanks.
neverMind
my MinGW-GCC dont have any errors, replace fpos_t with long like http://codepad.org/g3OpxZoP
here another better solution without compiler-errors: http://www.ideone.com/trPG4
THis code is great, but instead of putting to the stdout, I would prefer to have a string that stores each line. Could you tell me how to do that?I'm sick of trying, but without success :(
neverMind
try and be happy: http://ideone.com/8ySsH
Tks, I've tried it and I've modified to return a struct or a char* instead of just printing each line. Do you now if this is faster then counting the total lines with fgets and then tell beg_line and end_line to each son and make each one of them do a while(fgets).... like user476452 tells above?
neverMind
fgets is a bad idea here, because fgets must have the max. linelength (with '\n') as second parameter. Therefore you must know the max. linelength before you open a file! Try it out.
I know that, but I know my max size is 1024, so when I use fgets and count the lines is much faster then this code (I've really measured it- about less 6seconds with a 1GB file).It works and I've adapted to return a char* and it worked but too slowly for what I want. Nevertheless, I learned. Thanks for your share :)
neverMind
Do you know setvbuf? Its C89, try setvbuf(fp, NULL, _IOLBF, 1024); and your fgetc works buffered like fgets.
+1  A: 

i think it can help you. http://migre.me/1A0g9

John Doe
Thanks, I already have coded something like you have here! I count the lines and then I divide min-max range to each child to read (each child opens the file). It may not be the fastest algorithm to do that (perhaps fseek and pointers), but is easier for me to understand and debug.
neverMind
How can I tell the time?When I do start = clock (); within the parent, start the countdown clock,but the father goes to sleep when the child is working and the clock pauses when the parent process is S (sleeping). How can I make the clock does not stop counting?thanks
John Doe