views:

77

answers:

2

I'm writing a perl script to run some kind of a pipeline. I start by reading a JSON file with a bunch of parameters in it. I then do some work - mainly building some data structures needed later and calling external programs that generate some output files I keep references to.

I usually use a subroutine for each of these steps. Each such subroutine will usually write some data to a unique place that no other subroutine writes to (i.e. a specific key in a hash) and reads data that other subroutines may have generated.

These steps can take a good couple of minutes if done sequentially, but most of them can be run in parallel with some simple logic of dependencies that I know how to handle (using threads and a queue). So I wonder how I should implement this to allow sharing data between the threads. What would you suggest the framework to be? Perhaps use an object (of which I will have only one instance) and keep all the shared data in $self? Perhaps a simple script (no objects) with some "global" shared variables? ...

I would obviously prefer a simple, neat solution.

+1  A: 

You can certainly do that in Perl, I suggest you look at perldoc threads and perldoc threads::shared, as these manual pages best describe the methods and pitfalls encountered when using threads in Perl.

What I would really suggest you use, provided you can, is instead a queue management system such as Gearman, which has various interfaces to it including a Perl module. This allows you to create as many "workers" as you want (the subs actually doing the work) and create one simple "client" which would schedule the appropriate tasks and then collate the results, without needing to use tricks as using hashref keys specific to the task or things like that.

This approach would also scale better, and you'd be able to have clients and workers (even managers) on different machines, should you choose so.

Other queue systems, such as TheSchwartz, would not be indicated as they lack the feedback/result that Gearman provides. To all effects, using Gearman this way is pretty much as the threaded system you described, just without the hassles and headaches that any system based on threads may eventually suffer from: having to lock variables, using semaphores, joining threads.

mfontani
thank you. what I miss is how do you suggest to share the information between the threads?
David B
David, check out http://search.cpan.org/~bradfitz/Gearman/lib/Gearman/Worker.pm
Octoberdan
David, depending on the approach: with threads, use threads::shared's shared variables.With Gearman, rather than sharing variables you may want to pass critical data to the worker thread, as in "do operation X with this, this and this other piece of data". If you need different subs to handle the same piece of data, have them return the munged data.
mfontani
+2  A: 

Read threads::shared. By default, as perhaps you know, perl variables are not shared. But you place the shared attribute on them, and they are.

my %repository: shared;

Then if you want to synchronize access to them, the easiest way is to

{   lock( %repository );
    $repository{JSON_dump} = $json_dump;
}
# %respository will be unlocked at the end of scope.

However you could use Thread::Queue, which are supposed to be muss-free, and do this as well:

$repo_queue->enqueue( JSON_dump => $json_dump );

Then your consumer thread could just:

my ( $key, $value ) = $repo_queue->dequeue( 2 );
$repository{ $key } = $value;
Axeman
+1 thanks Axeman. is it necessary to lock the entire repository when only a part of it (e.g. `repository->{key}` is changed?
David B
@David B, yes, unfortunately, it is. Refer to http://search.cpan.org/perldoc?threads::shared#lock_VARIABLE
Axeman