views:

47

answers:

2

I have a computation task which is split in several individual program executions, with dependencies. I'm using Condor 7 as task scheduler (with the Vanilla Universe, due do constraints on the programs beyond my reach, so no checkpointing is involved), so DAG looks like a natural solution. However some of the programs need to run on the same host. I could not find a reference on how to do this in the Condor manuals.

Example DAG file:

JOB  A  A.condor 
JOB  B  B.condor 
JOB  C  C.condor    
JOB  D  D.condor
PARENT A CHILD B C
PARENT B C CHILD D

I need to express that B and D need to be run on the same computer node, without breaking the parallel execution of B and C.

Thanks for your help.

+1  A: 

I don't know the answer but you should ask this question on the Condor Users mailing list. The folks who support the DAG functionality in Condor monitor it and will respond. See this page for subscription information. It's fairly low traffic.

It's generally fairly difficult to keep two jobs together on the same host in Condor without locking them to a specific host in advance, DAG or no DAG. I actually can't think of a really viable way to do this that would let B start before C or C start before B. If you were willing to enforce that B must always start before C you could make part of the work that Job B does when it starts running be modify the Requirements portion of Job C's ClassAd so that it has a "Machine == " string where is the name of the machine B landed on. This would also require that Job C be submitted held or not submitted at all until B was running, B would also have to release it as part of its start up work.

That's pretty complicated...

So I just had a thought: you could use Condor's dynamic startd/slots features and collapse your DAG to achieve what you want. In your DAG where you currently have two separate nodes, B and C, you would collapse this down into one node B' that would run both B and C in parallel when it starts on a machine. As part of the job requirements you note that it needs 2 CPUs on a machine. Switch your startd's to use the dynamic slot configuration so machines advertise all of their resources and not just statically allocated slots. Now you have B and C running concurrently on one machine always. There are some starvation issues with dynamic slots when you have a few multi-CPU jobs in a queue with lots of single-CPU jobs, but it's at least a more readily solved problem.

Another option is to tag B' with a special job attribute:

MultiCPUJob = True

And target it just at slot 1 on machines:

Requirements = Slot == 1 &&  ...your other requirements...

And have a static slot startd policy that says, "If a job with MultiCPUJob=True tries to run on slot 1 on me preempt any job that happens to be in slot 2 on this machine because I know this job will need 2 cores/CPUs".

This is inefficient but can be done with any version of Condor past 6.8.x. I actually use this type of setup in my own statically partitioned farms so if a job needs a machine all to itself for benchmarking it can happen without reconfiguring machines.

If you're interested in knowing more about that preemption option let me know and I can point you to some further configuration reading in the condor-user list archives.

Ian C.
thanks for the tip.
gurney alex
mail posted, will sum up here if I get meaningful answers
gurney alex
No answers on the list :-( Maybe it's just not feasible. I'll work around that issue.
gurney alex
Yea, it sounds hard enough to be impossible to me.
Ian C.
+2  A: 

Condor doesn't have any simple solutions, but there is at least one kludge that should work:

Have B leave some state behind on the execute node, probably in the form of a file, that says something like MyJobRanHere=UniqueIdentifier". Use the STARTD_CRON support to detect this an advertise it in the machine ClassAd. Have D use Requirements=MyJobRanHere=="UniqueIdentifier". A part of D's final cleanup, or perhaps a new node E, it removes the state. If you're running large numbers of jobs through, you'll probably need to clean out left-over state occasionally.

Alan De Smet
Nice trick. I went for a similar and a bit simpler way, finally : job B and D are run both in a same script, but D waits for C to create a file in a known place on a shared drive.
gurney alex