views:

49

answers:

1

What challenges in shared memory parallel programming (particularly multicore) cannot be solved or cannot be solved efficiently using a Cilk-style solution (i.e. nested data parallelism with per-core work stealing task deques)?

+1  A: 

I think Cilk's model is nested task parallelism (which you can use to implement data parallelism). It is pretty cute, but...

Cilk doesn't support SIMD data parallelism or streaming parallelism.

Cilk doesn't appear to handle partial orders over tasks well, because it only offers nested parallelism. Try coding the following set of parallel tasks: A,B,C,D with ordering constraints A before B, A before D, C before D. (This is the canonical example of the smallest partial task order that nested task parallelism can't encode directly). You lose some the parallelism implementing this with nested parallelism. Parallelism being precious, you don't want to waste opportunities to be parallel.

It doesn't handle (AFAIK) exception handling across thread boundaries. This is needed if you want to build a really big, complex symbolic application. It is also useful if you want to compute speculatively.

I also don't think Cilk can handle large sets of interacting (waiting on synchronization events) between computation grains, because AFAIK in a Cilk proram can only have as many live parallel computations as there are threads. This is due to the Cilk implementation choice of living on top of standard C/C++ compilers and their stack models, which in most workstations use the one-big-stack-per-thread model. You might get to 10 or 100, but not 10,000; this matters when handling very large graphs. [I don't know for a fact that Cilk even allows computation grains to synchronize, but I don't see any technical reason why it could not if it gave up the large-stack-model].

A second implication is that Cilk applications can't recurse over huge data structures, because whatever size stack you choose is bounded, and there's some example for which you will run out of stack. (This isn't a design flaw of Cilk, just of its implementation). That's too bad, because huge things are one of the places where you want parallelism.

For an alternative, see PARLANSE, which offers arbitrarily large number of computation grains, with work-stealing, but with heap-allocated grains and activation records. Each grain has its own context (and therefore one can implement large interacting sets of grains, because it is straightforward to save grain state when it needs to wait on an event. PARLANSE synchonization primitives include futures, semaphores, and critical function results (see below)

PARLANSE offers explicit "teams" (set of computational grains) as an abstraction, with exceptions propagating out of functions to the top of a computational grain (Java defines to to be "undefined"), to the team parent, back to all the other team children as an asynchronous abort-exception (catchable in a try clause) allowing the other children to clean up.

Because some actions (e.g., asynchronous abort-exceptions) can occur at arbitrary times, PARLANSE offers the notion of critical functions, whose results are guaranteed to be returned to the caller atomically, so a function either returns a result assuredly, or does not, and the function can clean up resources safely in an asynch abort handler.

Special "partial order" teams allows one to encode computations in which the partial order is known; I think this is more efficiently than Cilk's nested parallelism if you have large sets of such.

(We use PARLANSE to implement large-scale program analysis/transformation tools.. PARLANSE was invented to support this; we want parallelism because the artifacts we deal with are huge, well, trees and graphs, representing millions of lines of code).

(PARLANSE doesn't do streams or SIMD either, but they aren't out of scope of the language. One could arguably add streams and SIMD to C and C++ but its likely to be pretty hard).

Ira Baxter