How to paralleize search for a string in a file with a help of fork? (GNU Linux/g++)

views:

216

answers:

+2 Q:

How to paralleize search for a string in a file with a help of fork? (GNU Linux/g++)

This is how the situation looks:

I got a text file with a couple of lines and I am looking for a string in this file. I need to pass following command line parameters to the program:
- file path
- the string I am looking for
- maximum number of processes the program is allowed to "fork" in order to complete this task.

How to such program should be constructed?

Thanks for your help.

+3 A:

A couple of thoughts.

You will have to open the file separately from each process, otherwise they will share a single file descriptor and thus have a shared position in the file (or not, see the comments, as this may be system specific...).
You may not see the speed increase you are hoping for due to disk access and/or cache miss patterns.

You might be able to beat both issues by memory mapping the file (well you still risk an increased cache miss rate)...

How badly do you need this? It runs a real risk of being premature optimization. I would recommend against touching the problem without a compelling need. Really.

dmckee 2008-10-11 19:46:14

Optionally, instead of opening many file descriptors or mmapping, you can use `pread` to read from a file at many offsets concurrently.

ephemient 2008-10-12 01:37:49

**man pread** Hey! You learn something everyday. Thanks.

dmckee 2008-10-12 02:38:15

When you call fork(), it clones the file descriptors, so that child processes will NOT share a file position with its parent, unlike threads which use the same descriptors.

MarkR 2008-10-12 11:50:20

@MarkR: From the fork man page on my Mac OS 10.5 system: "an lseek(2) on a descriptor in the child process can affect a subsequent read or write by the parent" (because both FILE*'s point to the same OS structure). Could this be OS dependent?

dmckee 2008-10-12 20:30:10

The manpage on my Debian Etch system is not specific on the matter.

dmckee 2008-10-12 20:44:05

Hmm ok maybe it's system specific whether seeks affect the parent's/ child's sibling's file descriptor. Closing one certainly doesn't affect another.

MarkR 2008-10-13 20:47:27

+2 A:

Consider why you think you need to parallelize this, and if you're going to see any actual performance benefit. You're likely to be limited by disk access time, and there's overhead to forking. Your best option might be to do a standard single-threaded search (probably with a regex).

Nick 2008-10-11 19:51:06

Regex searching is actually pretty slow... There are so many options, unless you use a DFA which is fast but complex. The really fast way to do it is to use the jump table trick where you read ahead by the length of the string, and backtrack by a certain amount if the character actually appears.

Mark Santesson 2008-10-12 03:01:10

+1 A:

RaphaelSP 2008-10-11 21:23:01

+1 A:

Either this is homework, or this is useless. The bottleneck is in disk bandwidth, not CPU power. You will only slow down by using simultaneous accesses.

rlerallut 2008-10-11 21:35:24

It is probably homework, but if he was looking for a regex, then some of those can be computationally difficult. But he isn't looking for a regex, so I don't know why I bother mentioning it...

Mark Santesson 2008-10-12 02:51:21

I think that parallelizing regex matching is *tough*, but I'm not an expert...

rlerallut 2008-10-13 22:15:08

ansaurus

tags:

views:

answers:

How to paralleize search for a string in a file with a help of fork? (GNU Linux/g++)

related questions