ansaurus

Question

Easy parallelisation

Answer 1

+1 A:

I think you could to the following

for i in `find . | grep ".xml$"`; do bzip2 $i&; done

But that would spin off however many processes as you have files instantly and isn't an optimal as just running four processes at a time.

Tom Ritter 2008-11-11 19:46:52

This would be OK for small jobs, but I was running the above command on about 5,000 files. I suspect that would kill my computer stone dead! :)

Chris Jefferson 2008-11-11 19:48:36

It would drown other processes, but the Linux scheduler is pretty good at making sure processes don't get completely starved. The issue here is memory usage, since paging will really kill performance.

sep332 2008-11-11 19:50:25

I personally like this answer because it works without any extra tools being installed. It would work well in a situation where you are launching less file searches.

Tom Leys 2008-11-11 21:02:03

Answer 2

+6 A:

This perl program fits your needs fairly well, you would just do this:

runN -n 4 bzip2 `find . | grep ".xml$"`

Peter Crabtree 2008-11-11 19:53:33

Oftentimes running more than 4 can increase performance if you have 4 processors. The fifth and higher jobs can jump in when one of the others is waiting for I/O.

sep332 2008-11-11 19:56:42

Good point—on the other hand, the four processes *competing* for I/O and cache lines can sometimes slow down the total process.

Peter Crabtree 2008-11-11 20:10:53

Answer 3

+2 A:

The answer to the general question is difficult, because it depends on the details of the things you are parallelizing. On the other hand, for this specific purpose, you should use pbzip2 instead of plain bzip2 (chances are that pbzip2 is already installed or at least in the repositories or your distro). See here for details: http://compression.ca/pbzip2/

Davide 2008-11-11 19:53:49

Answer 4

+2 A:

I find this kind of operation counterproductive. The reason is the more processes access the disk at the same time the higher the read/write time goes so the final result ends in a longer time. The bottleneck here won't be a CPU issue, no matter how many cores you have.

Haven't you ever performed a simple two big file copies at the same time on the same HD drive? I is usually faster to copy one and then another.

I know this task involves some CPU power (bzip2 is demanding compression method), but try measuring first CPU load before going the "challenging" path we all technicians tend to choose much more often than needed.

Fernando Miguélez 2008-11-11 20:00:55

Using the 'runN' script below, if I run 3 copies, I get a 2x speed-up (at 4 copies, is starts to slow down again), so it seems it is worth doing :)

Chris Jefferson 2008-11-11 20:17:24

Ok, so this time the "challenging" path really pays off

Fernando Miguélez 2008-11-11 20:21:28

Some systems deal with concurrent disk accesses better (LOTS better!) than others. http://stackoverflow.com/questions/9191/how-to-obtain-good-concurrent-read-performance-from-disk

timday 2009-04-03 08:32:21

Answer 5

+4 A:

gnu make has a nice parallelism feature (eg. -j 5) that would work in your case. Create a Makefile

%.xml.bz2 : %.xml


all: $(patsubt %.xml,%xml.bz2,$(shell find . -name '*.xml') )

then do a

nice make -j 5

replace '5' with some number, probably 1 more than the number of CPU's. You might want to do 'nice' this just in case someone else wants to use the machine while you are on it.

David Nehme 2008-11-11 20:53:18

I was gonna suggest using make. But you beat me to it =)

gnud 2008-11-11 21:11:49

@gnud, I am interested in how you would write the makefile (if it's different from this).

David Nehme 2008-11-11 21:14:52

Answer 6

+2 A:

I did something like this for bash. The parallel make trick is probably a lot faster for one-offs, but here is the main code section to implement something like this in bash, you will need to modify it for your purposes though:

#!/bin/bash

# Replace NNN with the number of loops you want to run through
# and CMD with the command you want to parallel-ize.

set -m

nodes=`grep processor /proc/cpuinfo | wc -l`
job=($(yes 0 | head -n $nodes | tr '\n' ' '))

isin()
{
  local v=$1

  shift 1
  while (( $# > 0 ))
  do
    if [ $v = $1 ]; then return 0; fi
    shift 1
  done
  return 1
}

dowait()
{
  while true
  do
    nj=( $(jobs -p) )
    if (( ${#nj[@]} < nodes ))
    then
      for (( o=0; o<nodes; o++ ))
      do
        if ! isin ${job[$o]} ${nj[*]}; then let job[o]=0; fi
      done
      return;
    fi
    sleep 1
  done
}

let x=0
while (( x < NNN ))
do
  for (( o=0; o<nodes; o++ ))
  do
    if (( job[o] == 0 )); then break; fi
  done

  if (( o == nodes )); then
    dowait;
    continue;
  fi

  CMD &
  let job[o]=$!

  let x++
done

wait

Steve Baker 2008-11-11 21:10:00

Answer 7

+9 A:

Solution: Use xargs to run in parallel (don't forget the -n option!)

find -name \*.xml -print0 | xargs -0 -n 1 -P 3 bzip2

Johannes Schaub - litb 2008-11-11 21:20:37

I'll give you the tick, as I'm sure this is the best answer :)

Chris Jefferson 2008-11-12 00:33:17

ansaurus

tags:

views:

answers:

Easy parallelisation

related questions