ansaurus

Question

Answer 1

+2 A:

The way you want to do it seems sound to me, or you need to post more informations about what you are trying to accomplish.

And advice: I would put the workflow structure in a list with tuples rather than a dictionary

workflow = [ ('dataset', 'some dataset'),
             ('algorithm1', "parameters"),
             ('algorithm2', "parameters"),
             ('algorithm3', "parameters")]

fabrizioM 2010-01-24 11:31:04

Answer 2

+3 A:

If each algorithm works on each element on dataset, map() would be an elegant option:

dataset=workflow[0]
for algorithm in workflow[1:]:
    dataset=map(algorithm, dataset)

e.g. for the square roots of odd numbers only, use,

>>> algo1=lambda x:0 if x%2==0 else x
>>> algo2=lambda x:x*x
>>> dataset=range(10)
>>> workflow=(dataset, algo1, algo2)
>>> for algo in workflow[1:]:
    dataset=map(algo, dataset)
>>> dataset
[0, 1, 0, 9, 0, 25, 0, 49, 0, 81]

Adam Matan 2010-01-24 11:31:13

Less concise and elegant maybe than the highest scored answer, but I would tend to prefer this for readability and further use by other users.

Morlock 2010-01-24 19:40:54

Answer 3

+1 A:

Define a Dataset class that tracks... data... for your set. Define methods in this class. Something like this:

class Dataset:
    # Some member fields here that define your data, and a constructor

    def algorithm1(self, param1, param2, param3):
        # Update member fields based on algorithm

    def algorithm2(self, param1, param2):
        # More updating/processing

Now, iterate over your "workflow" dict. For the first entry, simply instantiate your Dataset class.

myDataset = Dataset() # Whatever actual construction you need to do

For each subsequent entry...

Extract the key/value somehow (I'd recommend changing your workflow data structure if possible, dict is inconvenient here)
Parse the param string to a tuple of arguments (this step is up to you).
Assuming you now have the string algorithm and the tuple params for the current iteration...

getattr(myDataset, algorithm)(*params)
This will call the function on myDataset with the name specified by "algorithm" with the argument list contained in "params".

Sapph 2010-01-24 11:37:23

Answer 4

+6 A:

You want to run a pipeline on some dataset. That sounds like a reduce operation (fold in some languages). No need for anything complicated:

result = reduce(lambda data, (aname, p): algo_by_name(aname)(p, data), workflow)

This assumes workflow looks like (text-oriented so you can load it with YAML/JSON):

workflow = ['data', ('algo0', {}), ('algo1', {'param': value}), … ]

And that your algorithms look like:

def algo0(p, data):
    …
    return output_data.filename

algo_by_name takes a name and gives you an algo function; for example:

def algo_by_name(name):
    return {'algo0': algo0, 'algo1': algo1, }[name]

(old edit: if you want a framework for writing pipelines, you could use Ruffus. It's like a make tool, but with progress support and pretty flow charts.)

Tobu 2010-01-24 11:41:21

Excellent and elegant.

Adam Matan 2010-01-24 11:43:34

Sorry for being that newbie, but I couldn't make the code work. Will I return algorithm names from algo_by_name function?

Stephen T. 2010-01-24 16:39:48

algo_by_name(aname) needs to be a function, so you can pass (p, data) to it. I wrote an example.

Tobu 2010-01-24 17:15:55

Thanks for the example, I appreciate it.

Stephen T. 2010-01-24 19:16:55

Answer 5

+1 A:

Here is how I would do this (all code untested):

Step 1: You need to create the algorithms. The Dataset could look like this:

class Dataset(object):
    def __init__(self, dataset):
        self.dataset = dataset

    def __iter__(self):
        for x in self.dataset:
            yield x

Notice that you make an iterator out of it, so you iterate over it one item at a time. There's a reason for that, you'll see later:

Another algorithm could look like this:

class Multiplier(object):
    def __init__(self, previous, multiplier):
        self.previous = previous
        self.multiplier = multiplier
    def __iter__(self):
        for x in previous:
            yield x * self.multiplier

Step 2

Your user would then need to make a chain of this somehow. Now if he had access to Python directly, you can just do this:

dataset = Dataset(range(100))
multiplier = Multiplier(dataset, 5)

and then get the results by:

for x in multiplier:
    print x

And it would ask the multiplier for one piece of data at a time, and the multiplier would in turn as the dataset. If you have a chain, then this means that one piece of data is handled at a time. This means you can handle huge amounts of data without using a lot of memory.

Step 3

Probably you want to specify the steps in some other way. For example a text file or a string (sounds like this may be web-based?). Then you need a registry over the algorithms. The easiest way is to just create a module called "registry.py" like this:

algorithms = {}

Easy, eh? You would register a new algorithm like so:

from registry import algorithms
algorithms['dataset'] = Dataset
algorithms['multiplier'] = Multiplier

You'd also need a method that creates the chain from specifications in a text file or something. I'll leave that up to you. ;)

(I would probably use the Zope Component Architecture and make algorithms components and register them in the component registry. But that is all strictly speaking overkill).

Lennart Regebro 2010-01-24 12:19:14

ansaurus

tags:

views:

answers:

Processing a simple workflow in Python

related questions