views:

96

answers:

2

Folks, I have a problem. I am a writing a script in python which consists of several modules. Some of the modules are dependent on other modules, hence they should be run only after the dependent modules are successfully run. So each modules derives from a base class module and overrides a list called DEPENDENCIES which is a list of dependecies to be met beofre this module is run. There is one module which needs to be run before all other modules.Currently I am doing something like this.

modules_to_run.append(a)
modules_to_run.append(b)
modules_to_run.append(c)
.....
.....     
modules_to_run.append(z)


# Very simplistically just run the Analysis modules sequentially in
# an order that respects their dependencies
foundOne = True
while foundOne and len(modules_to_run) > 0:
    foundOne = False
    for module in modules_to_run:
        if len(module.DEPENDENCIES) == 0:
            foundOne = True
            print_log("Executing module %s..." % module.__name__, log)
            try:
                module().execute()
                modules_to_run.remove(module)
                for module2 in modules_to_run:
                    try:
                        module2.DEPENDENCIES.remove(module)
                    except:
                        #module may not be in module2's DEPENDENCIES
                        pass
            except Exception as e:
                print_log("ERROR: %s did not run to completion" % module.__name__, log)
                modules_to_run.remove(module)
                print_log(e, log)

for module in modules_to_run:
    name = module.__name__
    print_log("ERROR: %s has unmet dependencies and could not be run:" % name, log)
    print_log(module.DEPENDENCIES, log)

Now I am seeing that some modules are taking long time to execute and script tun time is too long. So I wanted to make it multi threaded so that independent modules can run simultaneously thus saving time. So I want a solution where after each iteration , I'll recalculate 'n' independent modules ( where 'n' is max no of threads, typically 2 to begin with) and execute them in parallel and wait for them to complete before next iteration. I dont know much about algorithms so I am stuck. Can you folks please help me to find an algorithm which finds max 'n' set of independent modules after each iteration which are no way dependent on each other.

+2  A: 

I posted a description of topological sorting recently in a question about make -j. Serendipity! From the Wikipedia article:

The canonical application of topological sorting (topological order) is in scheduling a sequence of jobs or tasks; topological sorting algorithms were first studied in the early 1960s in the context of the PERT technique for scheduling in project management (Jarnagin 1960). The jobs are represented by vertices, and there is an edge from x to y if job x must be completed before job y can be started (for example, when washing clothes, the washing machine must finish before we put the clothes to dry). Then, a topological sort gives an order in which to perform the jobs.

Rough outline:

  1. Build a dependency graph.
  2. Find n modules that have no dependencies. These can be executed now in parallel.
  3. Remove those modules from the graph.
  4. Repeat step 2 until done.

Read those links for a more detailed description.

John Kugelman
+1  A: 

From your setting description you can also do it directly.

It looks like every modules known it's dependencies. Then adding a predicate function in every module stating if it can run is simple enough. A module can be run if and only if all of it's prerequisites dependencies are satisfied.

Top level modules have no dependencies so they can run from the start.

Basically that's a trivial implementation of a partial topological sorting (you don't have to explore all the dependency graph, just stay at top level).

Two pitfalls to be aware of:

If your dependencies contains cycles (A depends on B depending on C depending on A) it may loop forever (it means the problem has no solution). You should detect this case and report and error.

The modules you can run may be less than the number of thread. That should not be an error. Then you have found a solution either when you got n available modules to run or when you asked every modules if they can be run.

kriss
You're right! I can just implement a can_run() method and loop through all remaining modules to find at most "n" modules and then start worker threads to run them. Thanks for the suggestion,
kumar