views:

94

answers:

3

The company I work for has lots of "complicated" file based batch processes, with sequences of files such as:

  • take file A
  • fetch file B
  • join fields in file A to file B to make file C
  • run some heuristics on file C to make file D
  • upload file D to server X
  • build a report based on files D and A and mail it to [email protected]

Each step may take many hours to run (files may contain billions of lines of data). The while thing is glued together with GNU Makefiles, with sections such as:

fileC: fileD run-analysis $^ > $@

The Makefiles are useful for modelling the dependencies between steps, as well as allowing everything after a certain step to be repeated (if there's a problem with a step, or the heuristics are changed and so on).

Using Makefiles always seems bad to me, as they're for building software, not running batch processes. Also, Makefiles don't provide any form of testing framework.

My question is, how do you coordinate large sequences large operations like these?

+5  A: 

Makefile can be used for building software.
But, they are not limited to that activity.

Makefiles can help sequence many things.
Which includes test frameworks.

Have you used a Makefile based build, test, install sequence?
There are tools to make Makefiles!

Here are some out-of-the way uses to,

Other references within stackoverflow on,

nik
I'll second this. Take a look at The Make Book and you'll realise how many uses Makefiles have. http://oreilly.com/catalog/9780596006105/book/index.csp
scvalex
A: 

Has the data in the files outgrown the file structure? Perhaps it is time to start thinking about new data sources if the data in the files is indeed well-structured.

I am sensing that replacing files A and B with well structured data in a database is not an option, though. How about this:

  1. Load the structured data from file A and fetched file B into a series of relational database tables.
  2. Perform the joins from the tables to create data in another table(s) (or even in memory).
  3. Run the needed heuristics
  4. Create an output file D from the resulting data.
  5. Build a report from the resulting and initial data.

Steps 1+4 would still be slow, but I am betting you could speed up the entire process by using more efficient data structures for the actual processing.

The joy of working with databases is that many more programming options are available to you (pick a language you like) when it comes to writing the joining/processing routines. You do not need to rely on make files exclusively.

tehblanx
+3  A: 

Makefiles are actually quite good for this sort of thing and are quite widely used. They can be used for anything that involved dependency tracking (i've heard an anecdote about an expert system implemented as a makefile). Gnu make can execute multiple jobs in parallel.

You shouldn't get too worked up about makefiles as the alternatives are enterprise scheduling tools such as Control-M. These tools are:

  • Much, much more complicated

  • Very expensive

  • Fairly opaque and somewhat harder to test than a makefile

  • Politically difficult to get set up on your local machine so you can test them.

Stick with the makefiles unless you have a very good reason not to. Enterprise system management tools can be a win if you have really big installations with hundreds or thousands of heterogeneous systems. Unless you are operating on that scale there are very good reasons not to use tooling of that sort.

The principal argument against high-end 'enterprise' systems is that rolling out this type of infrastructure tends to empower an inner sanctum of hierophants to camp on the sacred knowledge of how to run these 'enterprise' systems. This process is known as 'empire building' in mangement literature. When challenged, the empire builder can easily blind management with science by implying that they have special knowledge and that no-one else is qualified to do anything with their 'enterprise' systems, which are far too sophisticated for mere mortals to comprehend.

'Enterprise Architecture' bureaucracy can be quite hard to argue with unless you are familiar with the tooling. Makefiles are familiar, everyman tools. You can argue about makefiles on an equal footing.

Stick with the makefiles. It keeps the bastards honest.

ConcernedOfTunbridgeWells