views:

532

answers:

7

I often find myself writing simple for loops to perform an operation to many files, for example:

for i in `find . | grep ".xml$"`; do bzip2 $i; done

It seems a bit depressing that on my 4-core machine only one core is getting used.. is there an easy way I can add parallelism to my shell scripting?

EDIT: To introduce a bit more context to my problems, sorry I was not more clear to start with!

I often want to run simple(ish) scripts, such as plot a graph, compress or uncompress, or run some program, on reasonable sized datasets (usually between 100 and 10,000). The scripts I use to solve such problems look like the one above, but might have a different command, or even a sequence of commands to execute.

For example, just now I am running:

for i in `find . | grep ".xml.bz2$"`; do find_graph -build_graph $i.graph $i; done

So my problems are in no way bzip specific! (Although parallel bzip does look cool, I intend to use it in future).

+1  A: 

I think you could to the following

for i in `find . | grep ".xml$"`; do bzip2 $i&; done

But that would spin off however many processes as you have files instantly and isn't an optimal as just running four processes at a time.

Tom Ritter
This would be OK for small jobs, but I was running the above command on about 5,000 files. I suspect that would kill my computer stone dead! :)
Chris Jefferson
It would drown other processes, but the Linux scheduler is pretty good at making sure processes don't get completely starved. The issue here is memory usage, since paging will really kill performance.
sep332
I personally like this answer because it works without any extra tools being installed. It would work well in a situation where you are launching less file searches.
Tom Leys
+6  A: 

This perl program fits your needs fairly well, you would just do this:

runN -n 4 bzip2 `find . | grep ".xml$"`
Peter Crabtree
Oftentimes running more than 4 can increase performance if you have 4 processors. The fifth and higher jobs can jump in when one of the others is waiting for I/O.
sep332
Good point—on the other hand, the four processes *competing* for I/O and cache lines can sometimes slow down the total process.
Peter Crabtree
+2  A: 

The answer to the general question is difficult, because it depends on the details of the things you are parallelizing. On the other hand, for this specific purpose, you should use pbzip2 instead of plain bzip2 (chances are that pbzip2 is already installed or at least in the repositories or your distro). See here for details: http://compression.ca/pbzip2/

Davide
+2  A: 

I find this kind of operation counterproductive. The reason is the more processes access the disk at the same time the higher the read/write time goes so the final result ends in a longer time. The bottleneck here won't be a CPU issue, no matter how many cores you have.

Haven't you ever performed a simple two big file copies at the same time on the same HD drive? I is usually faster to copy one and then another.

I know this task involves some CPU power (bzip2 is demanding compression method), but try measuring first CPU load before going the "challenging" path we all technicians tend to choose much more often than needed.

Fernando Miguélez
Using the 'runN' script below, if I run 3 copies, I get a 2x speed-up (at 4 copies, is starts to slow down again), so it seems it is worth doing :)
Chris Jefferson
Ok, so this time the "challenging" path really pays off
Fernando Miguélez
Some systems deal with concurrent disk accesses better (LOTS better!) than others. http://stackoverflow.com/questions/9191/how-to-obtain-good-concurrent-read-performance-from-disk
timday
+4  A: 

gnu make has a nice parallelism feature (eg. -j 5) that would work in your case. Create a Makefile

%.xml.bz2 : %.xml


all: $(patsubt %.xml,%xml.bz2,$(shell find . -name '*.xml') )

then do a

nice make -j 5

replace '5' with some number, probably 1 more than the number of CPU's. You might want to do 'nice' this just in case someone else wants to use the machine while you are on it.

David Nehme
I was gonna suggest using make. But you beat me to it =)
gnud
@gnud, I am interested in how you would write the makefile (if it's different from this).
David Nehme
+2  A: 

I did something like this for bash. The parallel make trick is probably a lot faster for one-offs, but here is the main code section to implement something like this in bash, you will need to modify it for your purposes though:

#!/bin/bash

# Replace NNN with the number of loops you want to run through
# and CMD with the command you want to parallel-ize.

set -m

nodes=`grep processor /proc/cpuinfo | wc -l`
job=($(yes 0 | head -n $nodes | tr '\n' ' '))

isin()
{
  local v=$1

  shift 1
  while (( $# > 0 ))
  do
    if [ $v = $1 ]; then return 0; fi
    shift 1
  done
  return 1
}

dowait()
{
  while true
  do
    nj=( $(jobs -p) )
    if (( ${#nj[@]} < nodes ))
    then
      for (( o=0; o<nodes; o++ ))
      do
        if ! isin ${job[$o]} ${nj[*]}; then let job[o]=0; fi
      done
      return;
    fi
    sleep 1
  done
}

let x=0
while (( x < NNN ))
do
  for (( o=0; o<nodes; o++ ))
  do
    if (( job[o] == 0 )); then break; fi
  done

  if (( o == nodes )); then
    dowait;
    continue;
  fi

  CMD &
  let job[o]=$!

  let x++
done

wait
Steve Baker
+9  A: 

Solution: Use xargs to run in parallel (don't forget the -n option!)

find -name \*.xml -print0 | xargs -0 -n 1 -P 3 bzip2
Johannes Schaub - litb
I'll give you the tick, as I'm sure this is the best answer :)
Chris Jefferson