Disclaimer: toying around with D 2.0 for maybe 10 days (tick "not regularly using D"). I consider this an opportunity to learn something about D.
Regarding 1 and 2: Easy to read and understand (writeln("Sum = ", myFuture.spinWait());
in the example should probably be writeln("Sum = ", myTask.spinWait());
).
Regarding 3: a parallel prefix would be nice. And I don't know enough about D, but I guess mutexes are defined somewhere else.
Regarding 4: your design seems to indicate that you have worker pool, start up a couple of threads, and threads then steal tasks from this pool. Now, I have debugged my share of bottlenecks (mostly of my own making). Besides NUMA and "judiciously" serializing things with the help of mutexes, pools can also be very "successful" at serializing your program and introducing overhead. I understand that the API does not prevent a good implementation. Just makes me wonder: why are map, reduce, parallel_for not functions? Does D offer advantages if these are methods?
Edit: I have played around with your library, and it is nice. It also scales well (relative to hand-coded threading) for cases with mostly calculations and low memory usage. I just reiterate the two suggestions I have already made:
I would consider separating algorithms (data parallelism) and task groups (task parallelism). This would bring it closer to more common C++ libraries (TBB, OpenMP, MS PPL and TPL). Also from an implementation perspective: you might want to schedule data parallism without task groups in the future (e.g. GPU bound) or use additional information (e.g. memory layout).
This already implies that the scheduler could be made independent from the TaskPool. Furthermore, I would also consider making the scheduler a singleton. To quote Intel's TBB FAQ on why the scheduler is a singleton:
[...] some libraries control program-wide
resources, such as memory and
processors. For example, garbage
collectors control memory allocation
across a program. Analogously, TBB
controls scheduling of tasks across a
program. To do their job effectively,
each of these must be a singleton; [...]
Allowing k instances of the TBB
scheduler in a single program would
cause there to be k times as many
software threads as hardware threads.
The program would operate
inefficiently, because the machine
would be oversubscribed by a factor of
k, causing more context switching,
cache contention, and memory
consumption.