views:

84

answers:

4

Hi all,

On UNIX, I have to produce numeric results out of previous data by means of various command-line utilities. These utilities read the starting data (in general, but not only, from csv files), perform computations, and write the data (again, in general, but not only, to csv files).

Of course, I don't want to run the risk to have outdated derived data, so I need to chain the data dependency through the utilities. A trivial similarity can be seen with a spreadsheet: when you change a cell, all the other cells that are related change as well, in a cascade fashion. However, due to the more complex and automatized nature of my task, I cannot use a spreadsheet.

Now, the first idea you can have is to use make, which is something I already had experience with. It is trivial, and fits the task well. You are however dependent on files, so if you have a dependency against some data that is stored in a database, you must trick the system. I also know about biomake, but as far as I remember is made in prolog, and I don't want to adventure through that path.

Before I adventure the makefile path, I am interested in additional input from you. Does any of you have nice suggestions on how to do this kind of data handling, better utilities than make, and on how to organize the file layout (of the data and the makefiles)?

Thanks

+1  A: 

Some alternatives that spring to mind:

  • Ant has pretty nice support for customizing dependencies using Java.
  • SCons allows you to write custom dependency code using Python.
JesperE
+1  A: 

Two other alternatives are

  • Jam the boost project uses it and
  • QMake used by the QT
oykuo
+1  A: 

Rake is a Ruby implementation of Dependency-Oriented Programming that is heavily inspired by Make and Ant, but much cleaner and nicer to use.

Recently, there has been a newcomer on the scene, which is called Tap. It also allows Dependency-Oriented Programming but extends it with concepts such as Workflows. It was designed by a PhD biochemistry student who works in a biomolecular research lab, specifically to do exactly the things you mention: keeping scientific data derived from experiments up to date.

Jörg W Mittag
I don't really like ruby, but I see some nice ideas here. Do you know if similar initiatives exist for python ?
Stefano Borini
+1  A: 

Assuming it is possible to discover that the database records are newer, then it should be possible to write a program that sets the date of a sentinel file to the date of the newest data record (or "now", if that is simpler) in the relevant source tables. Doing that for each database or query will give you a collection of sentinel files that can be used along with your existing CSV source files to feed the dependency tree and drive the whole calculation with standard make.

One easy answer to getting the sentinels updated on every build would be to use a build script that runs the data proxy generator followed by make in place of just the make command itself.

It should be possible to arrange for make to automatically update the sentinels as part of the normal dependency checks. Something like the following (untested) should do the trick:

all: result
clean:
        -rm table_*.txt
        -rm step*.csv
        -rm results.txt

results.txt: step2.csv
        write_report -o results.txt step2.csv

step1.csv: source.csv table_A.txt 
        do_step1 -o step1.csv source.csv

step2.csv: step1.csv table_B.txt
        do_step2 -o step2.csv

table_A.txt:
        touch_sentinel -o table_A.txt rawdata.sqlite A 

table_B.txt:
        touch_sentinel -o table_B.txt otherdata.sqlite B

where touch_sentinel creates an output file dated since the latest update to a table in a database. Determining how to learn the date is an exercise to the reader...

RBerteig
_very_ interesting. Thanks!
Stefano Borini