views:

49

answers:

3

I'm writing a tiny script that calls the "PNGOUT" util on a few hundred PNG files. I simply did this:

find $BASEDIR -iname "*png" -exec pngout {} \;

And then I looked at my CPU monitor and noticed only one of the core was used, which is quite sad.

In this day and age of dual, quad, octo and hexa (?) cores desktop, how do I simply parallelize this task with Bash? (it's not the first time I've had such a need, for quite a lot of these utils are mono-threaded... I already had the case with mp3 encoders).

Would simply running all the pngout in the background do? How would my find command look like then? (I'm not too sure how to mix find and the '&' character)

I if have three hundreds pictures, this would mean swapping between three hundreds processes, which doesn't seem great anyway!?

Or should I copy my three hundreds files or so in "nb dirs", where "nb dirs" would be the number of cores, then run concurrently "nb finds"? (which would be close enough)

But how would I do this?

+2  A: 

Parallellization is rarely trivial. In your case if you can select files uniquely in equal sized sets, then you can run multiple copies of your find script. You don't want to fire up 300 pictures in the background. For jobs like this it is usually faster to run them sequentially. Backgrounding the command or using batch are both viable options.

Assuming the files are consecutively numbered you could use a find pattern like "[0-4].png" for one find and "[5-9].png" on another. This would keep two cores running for roughly the same amount of time.

Farming task out would involve a dispatcher-runner setup. Building, testing, and running this would take quite a while.

Fire up BOINC to use those spare processesors. You will likely want to ignore niced processes when monitoring cpu frequency. Add code like this to rc.local.

for CPU in /sys/devices/system/cpu/cpu[0-9]*; do
    echo 1 > ${CPU}/cpufreq/ondemand/ignore_nice_load
done
BillThor
@BillThor: damn, I answered my own question at the same time you posted your answer. Interesting answer there. However look at what I found... It turns out *xargs* can be used to trivially parallelize such tasks :)
NoozNooz42
+2  A: 

to spawn all tasks in the background:

find $BASEDIR -iname "*png" | while read f; do
  pngout "$f" &
done

but of course that isn't the best option. to do 'n' tasks at a time:

i=0
find $BASEDIR -iname "*png" | while read f; do
  pngout "$f" &
  i=$((i+1))
  if [[ $i -gt $NTASKS ]]; then
    wait
    i=0
  fi
done

it's not optimal, since it waits until all the concurrent tasks are finished to start another group; but it should be better than nothing.

Javier
@Javier: +1 to you too... However I found a more elegant (I think) way to do it, using a relatively unknown feature of the *xargs* command :)
NoozNooz42
+8  A: 

Answering my own question... It turns out there's a relatively unknown feature of the xargs command that can be used to accomplish that:

find . -iname "*png" -print0 | xargs -0 --max-procs=4 -n 1 pngout

Bingo, instant 4x speedup on a quad-cores machine :)

NoozNooz42
Good catch. The advantage of open source. Someone took the time to build the spawn and monitor code into xargs. This is relatively trivial for something like xargs. Note that tt will likely crank up your CPU temp for the period that it runs. I monitor my quad core and rarely have any significant load. There are four BOINC tasks niced to the limit so load average is almost alway slightly over 4.
BillThor
haha! i had a faint memory that xargs could do that... but it was more fun to do it in bash, even if it's not optimal. (note, use -print0 on find and -0 in xargs to avoid problems with nasty filenames)
Javier
@Javier: it's great anyway to see here different ways to do it :) I edited my own answer to reflect your *-print0* / *-0* suggestion :)
NoozNooz42