views:

108

answers:

6

Currently, I am using a graph to store the dependencies and then running all of the vertices the don't have any dependencies. This works, but it feels kludgy. Is there a better algorithm or data structure I should be using?

#!/usr/bin/perl

use strict;
use warnings;

use Graph;

#FIXME: naive implementation, there may be a much better way to do this
sub run_in_parallel {
       my $g = shift->copy;

       while (my @v = $g->vertices) {
               my @run = grep { $g->is_successorless_vertex($_) } @v;
               print "running ", join(", ", @run), " in parallel\n";
               for my $compenent (@run) {
                       $g->delete_vertex($compenent);
               };
       }
}

my $g = Graph->new;
while (<DATA>) {
       my ($component, @dependencies) = split;
       unless ($g->has_vertex($component)) {
               $g->add_vertex($component);
       }
       for my $dependency (@dependencies) {
               unless ($g->has_vertex($dependency)) {
                       $g->add_vertex($dependency);
               }
               $g->add_edge($component, $dependency);
       }
}

run_in_parallel($g);

#component  dependency list
__DATA__
a           b c d
b           e
c           f g
d
e
f
g
+1  A: 

You can run in parallel any tasks with no unfinished dependencies. For example, on your dataset shown you can run d, e, f, and g in parallel at the start. When f and g are finished, you can run c in parallel, even if d and e are still running, etc. Your algorithm just needs every time a task finishes to mark it as done and reevaluate if any tasks are now available to run.

Karl Bielefeldt
Yeah, I had thought about that. It could be achieved by running each of the tasks, removing it from graph when it is done, and re-evaluating it. I was holding off on that in the hopes that there was already a well known algorithm for this sort of thing.
Chas. Owens
+1  A: 

I don't think it's kludgy, except that the syntax is a little verbose. You can fix that with wrapper functions.

You support dependencies being out of order, but you don't support cycles yet.

It very cleanly describes what you need to do, in a way that makes it very easy to check that it's being done correctly.

I might also create the reverse graph and use that for traversal, e.g. with Graph::Traversal::BFS.

Perhaps I wouldn't use Graph but represent the graphs as a hash of hashrefs, unless the graph will be used for other purposes (e.g. printed as a diagram, or analyzed or rewritten with graph algorithms).

You may want to add cycle detection.

reinierpost
I know there are no cycles (they wouldn't make sense, one thing has to run first). I feel like it is kludgy because it feels like there should be a better way than looking at every vertex left in the graph.
Chas. Owens
+2  A: 

Another idea is to use a Petri net. The nodes in your graph are its places, the actions are its transitions. Every place should have exactly one enabling transition, even when it has no dependencies. That way you don't even need to place any initial tokens.

This also incorporates Karl Bielefeldt's suggestion.

reinierpost
A: 

Meh, I don't know if I'd bother with a full-on graph representation here; instead, use either a list or a hash to store the jobs (depending on whether you need random access, or if you will sort by priority or some other constraint), and store its direct dependencies in a sub-field.

It doesn't really buy you much to calculate up front which jobs can be run in parallel, since that list is going to be constantly changing as jobs are completed. Instead, just start up whatever you can, immediately, and when you have available resources, look dynamically for the next best candidate.

# pseudocode:
use List::MoreUtils 'any';
while (my $job = get_next_job_that_I_would_ideally_run_now())
{
     # any { not } is subtly different than not all {} -- see how empty lists are handled!
     next if any { not $_->finished() } $job->get_dependent_jobs();

     # job has no dependencies or they are all complete: safe to run.
     # This probably involves forking or spawning a subthread to perform this
     # execution, and when it is done, the finished() state will be set so its
     # upstream dependencies can then run.
     $job->execute;

     # if no more resources are available for starting more jobs, wait here until
     # something is freed up; then we can continue on with our loop and look for the
     # next best job to start.
}
Ether
A Petri net engine will do this for you.
reinierpost
+1  A: 

It just happens that I was thinking about this recently while answering someone else's question.

Your algorithm in general looks right, although I'm not sure if there's a more efficient way to go about it. One thing though: you shouldn't precompute the way you are in your example code. that runs in lockstep; so if A B C are runnable at the start, it will run all three, and wait for all three to complete before looking further down the list. But what if D only depends on A, and A is the first job to finish, before B and C? You'd be wasting capacity not to start D right away, so you actually want to keep the graph as you run jobs, and re-poll the runnable jobs each time a job finishes unless you already have a full queue.

I actually started writing up an answer using Algorithm::Dependency::Ordered that's very much in the spirit of Ether's answer. It's simple to work with and lighter on memory than Graph, but I just realized that it suffers in part from the same under-utilization problem. I think it might be reparable, though, in which case I'll update here when I can. :)

hobbs
A: 

It seems to me that you could get rid of most of the vertex grepping by keeping track of end nodes when you add them to the graph, and as you process your node list.

if( @dependencies ) {
    for my $dependency (@dependencies) {
           unless ($g->has_vertex($dependency)) {
                   $g->add_vertex($dependency);
           }
           $g->add_edge($component, $dependency);
   }
}
else {
    push @end_components, $component;
}

Then when you process your data set, start one thread of execution (however you implement your parallelism) for each end node. When a thread completes, any parent nodes without other successors are added to the end node list. Keep processing the end node list until both the graph and the node list are empty .

sub run_in_parallel {
   my $g = shift->copy;
   my $end_vertices = shift;

   my @queue;

   while( $g->vertices ) {

      print "running ", join(", ", @queue), " in parallel\n";

      for my $compenent (@queue) {
          # When process finished.

          $g->delete_vertex($component);
          push @queue, grep { $g->is_successorless_vertex($_) } $g->predecessors($v);
      }

      # Add some error check for non-empty graph + empty queue + no processes waiting to finish.
   }
}

This change adds a bit of record keeping, and is a tad more fragile than your initial submission. Consider encapsulating these behaviors in an object that either contains or inherits from Graph.

The biggest performance benefit will be with large, deep dependency graphs. Minimal improvement will be seen with small graphs (where the cost of grepping is tiny compared to the cost of managing processes), or where the graphs are very shallow.

This code is thoroughly untested.

daotoad