views:

221

answers:

7

What is the best/easiest way to build a minimal task queue system for Linux using bash and common tools?

I have a file with 9'000 lines, each line has a bash command line, the commands are completely independent.

command 1 > Logs/1.log
command 2 > Logs/2.log
command 3 > Logs/3.log
...

My box has more than one core and I want to execute X tasks at the same time. I searched the web for a good way to do this. Apparently, a lot of people have this problem but nobody has a good solution so far.

It would be nice if the solution had the following features:

  • can interpret more than one command (e.g. command; command)
  • can interpret stream redirects on the lines (e.g. ls > /tmp/ls.txt)
  • only uses common Linux tools

Bonus points if it works on other Unix-clones without too exotic requirements.

A: 

Okay, after posting the question here, I found the following project which looks promising: ppss.

Edit: Not quite what I want, PPSS is focused on processing "all files in directory A".

Manuel
+4  A: 

Can you convert your command list to a Makefile? If so, you could just run "make -j X".

Gerald Combs
Perfect, this worked like a charm!
Manuel
A: 

Well, this is a kind of fun question anyway.

Here's what I'd do, assuming bash(1) of course.

  • figure out how many of these commands can usefully run concurrently. It's not going to be just the number of cores; a lot of commands will be suspended for I/O and that sort of thing. Call that number N. N=15 for example.
  • set up a trap signal handler for the SIGCHLD signal, which occurs when a child process terminates. trap signalHandler SIGCHLD
  • cat your list of commands into a pipe
  • write a loop that reads stdin and executes the commands one by one, decrementing a counter. When the counter is 0, it waits.
  • your signal handler, which runs on SIGCHLD, increments that counter.

So now, it runs the first N commands, then waits. When the first child terminates, the wait returns, it reads another line, runs a new command, and waits again.

Now, this is a case that takes care of many jobs terminating close together. I suspect you can get away with a simpler version:

 N=15
 COUNT=N
 cat mycommands.sh | 
 while read cmd 
 do
   eval $cmd &
   if $((count-- == 0))
   then
       wait
   fi
 od

Now, this one will start up the first 15 commands, and then run the rest one at a time as some command terminates.

Charlie Martin
A: 

Similar distributed-computing fun is the Mapreduce Bash Script:

http://blog.last.fm/2009/04/06/mapreduce-bash-script

And thanks for pointing out ppss!

A: 

You can use the xargs command, its --max-procs does what you want. For instance Charlie Martin solution becomes with xargs:

tr '\012' '\000' <mycommands.sh |xargs --null --max-procs=$X bash -c

details:

  • X is the number of processes max. E.g: X=15. --max-procs is doing the magic
  • the first tr is here to terminate lines by null bytes for xargs --null option so that quotes redirection etc are not expansed wrongly
  • bash -c runs the command

I tested it with this mycommands.sh file for instance:

date
date "+%Y-%m-%d" >"The Date".txt
wc -c <'The Date'.txt >'The Count'.txt
Colas Nahaboo
A: 

This is a specific case, but if you are trying to process a set of files and produce another set of output files, you can start #cores number of processes, and check if an output file exists before processing it. The example below converts a directory of .m4b files to .mp3 files:

Just run this command as many times as you have cores:

ls *m4b|while read f; do test -f ${f%m4b}mp3 || mencoder -of rawaudio "$f" -oac mp3lame -ovc copy -o ${f%m4b}mp3; done &

A: 

GNU Parallel http://www.gnu.org/software/parallel/ is a more general tool for parallelizing than PPSS.

If runfile contains:

command 1 > Logs/1.log
command 2 > Logs/2.log
command 3 > Logs/3.log

you can do:

cat runfile | parallel -j+0

which will run one command per CPU core.

If your commands are as simple as above you do not even need runfile but can do:

seq 1 3 | parallel -j+0 'command {} > Logs/{}.log'

If you have more computers available to do the processing you may want to look at the --sshlogin and --trc options for GNU Parallel.

Ole Tange