tags:

views:

1325

answers:

10

Lets say I have a loop in bash:

for foo in `some-command`
do
   do-something $foo
done

do-something is cpu bound and I have a nice shiny 4 core processor. I'd like to be able to run up to 4 do-something's at once.

The naive approach seems to be:

for foo in `some-command`
do
   do-something $foo &
done

This will run all do-somethings at once, but there are a couple downsides, mainly that do-something may also have some significant I/O which performing all at once might slow down a bit. The other problem is that this code block returns immediately, so no way to do other work when all the do-somethings are finished.

How would you write this loop so there are always X do-somethings running at once?

+6  A: 

Instead of a plain bash, use a Makefile, then specify number of simultaneous jobs with make -jX where X is the number of jobs to run at once.

Or you can use wait ("man wait"): launch several child processes, call wait - it will exit when the child processes finish.

maxjobs = 10

foreach line in `cat file.txt` {
 jobsrunning = 0
 while jobsrunning < maxjobs {
  do job &
  jobsrunning += 1
 }
wait
}

job ( ){
...
}

If you need to store the job's result, then assign their result to a variable. After wait you just check what the variable contains.

skolima
Thanks for this, even though the code is not finished it's given me the answer to a problem I'm having at work.
gerikson
+4  A: 

Maybe try a parallelizing utility instead rewriting the loop? I'm a big fan of xjobs. I use xjobs all the time to mass copy files across our network, usually when setting up a new database server. http://www.maier-komor.de/xjobs.html

tessein
+5  A: 
bstark
Realize there's some **serious** underquoting going on here so any jobs that require spaces in arguments will fail badly; moreover, this script will eat your CPU alive while it's waiting for some jobs to finish if more jobs are requested than maxjobs allows for.
lhunath
Also note that this assumes your script isn't doing anything else whatsoever to do with jobs; if you are, it'll count those toward maxjobs as well.
lhunath
You might want to use "jobs -pr" to limit to running jobs.
amphetamachine
+1  A: 

@skolima Your example doesn't really work at all, I know its meant to be pseudo-code but I think a working example of using wait would be better. Things like jobsrunning += 1 really needs to have a concrete example of how you would do that in bash. Also I think the basic logic is wrong, it doesn't connect to the $line variable at all and seems like would launch $maxjobs for each line, also it waits for all jobs to return before moving on to the next line.

thelsdj
+1  A: 

The project I work on uses the wait command to control parallel shell (ksh actually) processes. To address your concerns about IO, on a modern OS, it's possible parallel execution will actually increase efficiency. If all processes are reading the same blocks on disk, only the first process will have to hit the physical hardware. The other processes will often be able to retrieve the block from OS's disk cache in memory. Obviously, reading from memory is several orders of magnitude quicker than reading from disk. Also, the benefit requires no coding changes.

Jon Ericson
+1  A: 

Here an alternative solution that can be inserted into .bashrc and used for everyday one liner:

function pwait() {
    while [ $(jobs -p | wc -l) -ge $1 ]; do
        sleep 1
    done
}

To use it, all one has to do is put & after the jobs and a pwait call, the parameter gives the number of parallel processes:

for i in *; do
    do_something $i &
    pwait 10
done

It would be nicer to use wait instead of busy waiting on the output of jobs -p, but there doesn't seem to be an obvious solution to wait till any of the given jobs is finished instead of a all of them.

Grumbel
+2  A: 

While doing this right in bash is probably impossible, you can do a semi-right fairly easily. bstark gave a fair approximation of right but his has the following flaws:

  • Word splitting: You can't pass any jobs to it that use any of the following characters in their arguments: spaces, tabs, newlines, stars, question marks. If you do, things will break, possibly unexpectedly.
  • It relies on the rest of your script to not background anything. If you do, or later you add something to the script that gets sent in the background because you forgot you weren't allowed to use backgrounded jobs because of his snippet, things will break.

Another approximation which doesn't have these flaws is the following:

scheduleAll() {
    local job i=0 max=4 pids=()

    for job; do
        (( ++i % max == 0 )) && {
            wait "${pids[@]}"
            pids=()
        }

        bash -c "$job" & pids+=("$!")
    done

    wait "${pids[@]}"
}

Note that this one is easily adaptable to also check the exit code of each job as it ends so you can warn the user if a job fails or set an exit code for scheduleAll according to the amount of jobs that failed, or something.

The problem with this code is just that:

  • It schedules four (in this case) jobs at a time and then waits for all four to end. Some might be done sooner than others which will cause the next batch of four jobs to wait until the longest of the previous batch is done.

A solution that takes care of this last issue would have to use kill -0 to poll whether any of the processes have disappeared instead of the wait and schedule the next job. However, that introduces a small new problem: you have a race condition between a job ending, and the kill -0 checking whether it's ended. If the job ended and another process on your system starts up at the same time, taking a random PID which happens to be that of the job that just finished, the kill -0 won't notice your job having finished and things will break again.

A perfect solution isn't possible in bash.

lhunath
+7  A: 

Depending on what you want to do xargs also can help (here: converting documents with pdf2ps):

cpus=$( ls -d /sys/devices/system/cpu/cpu[[:digit:]]* | wc -w )

find . -name \*.pdf | xargs --max-args=1 --max-procs=$cpus  pdf2ps

From the docs:

--max-procs=max-procs -P max-procs Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time. Use the -n option with -P; otherwise chances are that only one exec will be done.

fgm
This method, in my opinion, is the most elegant solution. Except, since I'm paranoid, I always like to use `find [...] -print0` and `xargs -0`.
amphetamachine
+1  A: 

If you're familiar with the make command, most of the time you can express the list of commands you want to run as a a makefile. For example, if you need to run $SOME_COMMAND on files *.input each of which produces *.output, you can use the makefile

INPUT  = a.input b.input
OUTPUT = $(INPUT:.input=.output)

%.output : %.input
    $(SOME_COMMAND) $< $@

all: $(OUTPUT)

and then just run

make -j<NUMBER>

to run at most NUMBER commands in parallel.

Idelic
+2  A: 

With GNU Parallel http://www.gnu.org/software/parallel/ you can write:

some-command | parallel do-something

If you want to run one job per CPU core add -j+0:

some-command | parallel -j+0 do-something

GNU Parallel also supports running jobs on remote computers. This will run one per CPU core on the remote computers - even if they have different number of cores:

some-command | parallel -j+0 -S server1,server2 do-something

A more advanced example: Here we list of files that we want my_script to run on. Files have extension (maybe .jpeg). We want the output of my_script to be put next to the files in basename.out (e.g. foo.jpeg -> foo.out). We want to run my_script once for each core the computer has and we want to run it on the local computer, too. For the remote computers we want the file to be processed transferred to the given computer. When my_script finishes, we want foo.out transferred back and we then want foo.jpeg and foo.out removed from the remote computer:

cat list_of_files | \
parallel -j+0 --trc {.}.out -S server1,server2,: \
"my_script {} > {.}.out"

GNU Parallel makes sure the output from each job does not mix, so you can use the output as input for another program:

some-command | parallel do-something | postprocess

See the video for more examples: http://www.youtube.com/watch?v=LlXDtd_pRaY

Ole Tange