views:

246

answers:

5

I am working on a code which takes a dataset and runs some algorithms on it.

User uploads a dataset, and then selects which algorithms will be run on this dataset and creates a workflow like this:

workflow = 
{0: {'dataset': 'some dataset'},
 1: {'algorithm1': "parameters"},
 2: {'algorithm2': "parameters"},
 3: {'algorithm3': "parameters"}
}

Which means I'll take workflow[0] as my dataset, and I will run algorithm1 on it. Then, I will take its results and I will run algorithm2 on this results as my new dataset. And I will take the new results and run algorithm3 on it. It goes like this until the last item and there is no length limit for this workflow.

I am writing this in Python. Can you suggest some strategies about processing this workflow?

+2  A: 

The way you want to do it seems sound to me, or you need to post more informations about what you are trying to accomplish.

And advice: I would put the workflow structure in a list with tuples rather than a dictionary

workflow = [ ('dataset', 'some dataset'),
             ('algorithm1', "parameters"),
             ('algorithm2', "parameters"),
             ('algorithm3', "parameters")]
fabrizioM
+3  A: 

If each algorithm works on each element on dataset, map() would be an elegant option:

dataset=workflow[0]
for algorithm in workflow[1:]:
    dataset=map(algorithm, dataset)

e.g. for the square roots of odd numbers only, use,

>>> algo1=lambda x:0 if x%2==0 else x
>>> algo2=lambda x:x*x
>>> dataset=range(10)
>>> workflow=(dataset, algo1, algo2)
>>> for algo in workflow[1:]:
    dataset=map(algo, dataset)
>>> dataset
[0, 1, 0, 9, 0, 25, 0, 49, 0, 81]
Adam Matan
Less concise and elegant maybe than the highest scored answer, but I would tend to prefer this for readability and further use by other users.
Morlock
+1  A: 

Define a Dataset class that tracks... data... for your set. Define methods in this class. Something like this:

class Dataset:
    # Some member fields here that define your data, and a constructor

    def algorithm1(self, param1, param2, param3):
        # Update member fields based on algorithm

    def algorithm2(self, param1, param2):
        # More updating/processing

Now, iterate over your "workflow" dict. For the first entry, simply instantiate your Dataset class.

myDataset = Dataset() # Whatever actual construction you need to do

For each subsequent entry...

  • Extract the key/value somehow (I'd recommend changing your workflow data structure if possible, dict is inconvenient here)
  • Parse the param string to a tuple of arguments (this step is up to you).
  • Assuming you now have the string algorithm and the tuple params for the current iteration...

    getattr(myDataset, algorithm)(*params)

  • This will call the function on myDataset with the name specified by "algorithm" with the argument list contained in "params".

Sapph
+6  A: 

You want to run a pipeline on some dataset. That sounds like a reduce operation (fold in some languages). No need for anything complicated:

result = reduce(lambda data, (aname, p): algo_by_name(aname)(p, data), workflow)

This assumes workflow looks like (text-oriented so you can load it with YAML/JSON):

workflow = ['data', ('algo0', {}), ('algo1', {'param': value}), … ]

And that your algorithms look like:

def algo0(p, data):
    …
    return output_data.filename

algo_by_name takes a name and gives you an algo function; for example:

def algo_by_name(name):
    return {'algo0': algo0, 'algo1': algo1, }[name]

(old edit: if you want a framework for writing pipelines, you could use Ruffus. It's like a make tool, but with progress support and pretty flow charts.)

Tobu
Excellent and elegant.
Adam Matan
Sorry for being that newbie, but I couldn't make the code work. Will I return algorithm names from algo_by_name function?
Stephen T.
algo_by_name(aname) needs to be a function, so you can pass (p, data) to it. I wrote an example.
Tobu
Thanks for the example, I appreciate it.
Stephen T.
+1  A: 

Here is how I would do this (all code untested):

Step 1: You need to create the algorithms. The Dataset could look like this:

class Dataset(object):
    def __init__(self, dataset):
        self.dataset = dataset

    def __iter__(self):
        for x in self.dataset:
            yield x

Notice that you make an iterator out of it, so you iterate over it one item at a time. There's a reason for that, you'll see later:

Another algorithm could look like this:

class Multiplier(object):
    def __init__(self, previous, multiplier):
        self.previous = previous
        self.multiplier = multiplier
    def __iter__(self):
        for x in previous:
            yield x * self.multiplier

Step 2

Your user would then need to make a chain of this somehow. Now if he had access to Python directly, you can just do this:

dataset = Dataset(range(100))
multiplier = Multiplier(dataset, 5)

and then get the results by:

for x in multiplier:
    print x

And it would ask the multiplier for one piece of data at a time, and the multiplier would in turn as the dataset. If you have a chain, then this means that one piece of data is handled at a time. This means you can handle huge amounts of data without using a lot of memory.

Step 3

Probably you want to specify the steps in some other way. For example a text file or a string (sounds like this may be web-based?). Then you need a registry over the algorithms. The easiest way is to just create a module called "registry.py" like this:

algorithms = {}

Easy, eh? You would register a new algorithm like so:

from registry import algorithms
algorithms['dataset'] = Dataset
algorithms['multiplier'] = Multiplier

You'd also need a method that creates the chain from specifications in a text file or something. I'll leave that up to you. ;)

(I would probably use the Zope Component Architecture and make algorithms components and register them in the component registry. But that is all strictly speaking overkill).

Lennart Regebro