views:

216

answers:

4

This is how the situation looks:

I got a text file with a couple of lines and I am looking for a string in this file. I need to pass following command line parameters to the program:
- file path
- the string I am looking for
- maximum number of processes the program is allowed to "fork" in order to complete this task.

How to such program should be constructed?

Thanks for your help.

+3  A: 

A couple of thoughts.

  • You will have to open the file separately from each process, otherwise they will share a single file descriptor and thus have a shared position in the file (or not, see the comments, as this may be system specific...).
  • You may not see the speed increase you are hoping for due to disk access and/or cache miss patterns.

You might be able to beat both issues by memory mapping the file (well you still risk an increased cache miss rate)...


How badly do you need this? It runs a real risk of being premature optimization. I would recommend against touching the problem without a compelling need. Really.

dmckee
Optionally, instead of opening many file descriptors or mmapping, you can use `pread` to read from a file at many offsets concurrently.
ephemient
**man pread** Hey! You learn something everyday. Thanks.
dmckee
When you call fork(), it clones the file descriptors, so that child processes will NOT share a file position with its parent, unlike threads which use the same descriptors.
MarkR
@MarkR: From the fork man page on my Mac OS 10.5 system: "an lseek(2) on a descriptor in the child process can affect a subsequent read or write by the parent" (because both FILE*'s point to the same OS structure). Could this be OS dependent?
dmckee
The manpage on my Debian Etch system is not specific on the matter.
dmckee
Hmm ok maybe it's system specific whether seeks affect the parent's/ child's sibling's file descriptor. Closing one certainly doesn't affect another.
MarkR
+2  A: 

Consider why you think you need to parallelize this, and if you're going to see any actual performance benefit. You're likely to be limited by disk access time, and there's overhead to forking. Your best option might be to do a standard single-threaded search (probably with a regex).

Nick
Regex searching is actually pretty slow... There are so many options, unless you use a DFA which is fast but complex. The really fast way to do it is to use the jump table trick where you read ahead by the length of the string, and backtrack by a certain amount if the character actually appears.
Mark Santesson
+1  A: 
RaphaelSP
+1  A: 

Either this is homework, or this is useless. The bottleneck is in disk bandwidth, not CPU power. You will only slow down by using simultaneous accesses.

rlerallut
It is probably homework, but if he was looking for a regex, then some of those can be computationally difficult. But he isn't looking for a regex, so I don't know why I bother mentioning it...
Mark Santesson
I think that parallelizing regex matching is *tough*, but I'm not an expert...
rlerallut