views:

1063

answers:

9

I'm looking for a proper implementation of a work stealing queue in C/CPP. I've looked around Google but haven't found anything useful.

Perhaps someone is familiar with a good open-source implementation? (I prefer not to implement the pseudo-code taken from the original academic papers).

+7  A: 

Take a look at Intel's Threading Building Blocks.

http://www.threadingbuildingblocks.org/

Kevin
TBB is way more massive and complex for my needs. I'm looking for a much more simple, "dedicated" implementation ... if there is any
A: 

I don't think JobSwarm uses work stealing, but it's a first step. I'm not aware of other open source libraries for this purpose.

Francis Boivin
A: 

Will breaking your work tasks into smaller units eliminate the need for stealing work in the first place?

Jay
no, since statically distributing the work to multiple threads is just not efficient enough (each work item could take a different amount of time to complete). I'm looking to improve an already existing load-balancing algorithm, so a work stealing queue seems like an interesting option
If you have an existing load balancing algorithm and want to improve on it, why pre-suppose the solution involves work stealing? Why not present the situation as it is, and ask for better solutions. Work stealing might be one of them, but it almost certainly isn't the only one.
Novelocrat
@Novelocrat, a work-stealing queue is only one of the options I'm looking into
+1  A: 

The structured_task_group class of the PPL uses a work stealing queue for its implementation. If you need a WSQ for threading, I would recommend that.
If you are actually looking for the source, I don't know if the code is given in ppl.h or if there is a precompiled object; I will have to check when I get home tonight.

BlueRaja - Danny Pflughoeft
A: 

dunno if this would be of any help to you, but have a look at this article on AMD's developer network, its simple, but should give you something useful

Necrolis
+4  A: 

No free lunch.

Please take a look the original work stealing paper. This paper is hard to understand. I know that paper contains theoretical proof rather than pseudo code. However, there is simply no such much more simple version than TBB. If any, it won't give optimal performance. Work stealing itself incurs some amount of overhead, so optimizations and tricks are quite important. Especially, dequeues are must be thread-safe. Implementing highly scalable and low-overhead synchronizations are challenging.

I'm really wondering why you need it. I think that proper implementation means something like TBB and Cilk. Again, work stealing is hard to implement.

minjang
A: 

There exist a tool to simply doing it in an very elegant way. It is a really effective way to parrallelize your program in a very short time.

Cilk project

HPC Challenge Award

Our Cilk entry for the HPC Challenge Class 2 award won the 2006 award for ``Best Combination of Elegance and Performance''. The award was made at SC'06 in Tampa on November 14 2006.

Phong
+4  A: 

To implement "work stealing" isn't hard in theory. You need a set of queues containing tasks that do work by doing a combination of computing and generating other tasks to do more work. And you need atomic access to the queues to place newly generated tasks into those queues. Finally, you need a procedure that each task calls at the end, to find more work for the thread that executed the task; that procedure needs to look in work queues to find work.

Most such work-stealing systems make the assumption that there are a small number of threads (backed up typically by real processor cores), and that there is a exactly one work queue per thread. Then you first try to steal work from your own queue, and if it is empty, try to steal from others. What gets tricky is knowing which queues to look in; scanning them serially for work is pretty expensive and can create a huge amount of contention between threads looking for work.

So far this is all pretty generic stuff with one two major exceptions: 1) switching contexts (e.g, setting processor context registers such as a "stack") cannot be stated in pure C or C++. You can resolve this by agreeing to write part of your package in target-platform specific machine code. 2) Atomic access to the queues for a multiprocessor cannot be done purely in C or C++ (ignoring Dekker's algorithm), and so you'll need to code those using assembly language synchronization primitives like the X86 LOCK XCH or Compare and Swap. Now, the code involved in updating the queuse once you have safe access isn't very complex, and you could easily write that in a few lines of C.

However, I think you will find is that attempting to code such a package in C and C++ with mixed assembler is still rather inefficient and you'll eventually end up coding the entire thing in assembler anyway. All's that left are C/C++ compatible entry points :-}

I did this for our PARLANSE parallel programming language, which offers the idea of an arbitrarily large number of parallel computations live and interacting (synchonizing) at any instant. It is implemented behind the scenes on an X86 exactly with one thread per CPU, and the implementation is entirely in assembler. The work-stealing code is probably 1000 lines total, and its tricky code because you want it to be extremely fast in the non-contention case.

The real fly in the ointment for C and C++ is, when you create a task representing work, how much stack space do you assign? Serial C/C++ programs avoid this question by simply overallocating huge amounts (e.g, 10Mb) of one linear stack, and nobody cares much about how much of that stack space is wasted. But if you can create thousands of tasks and have them all live at a particular instant, you can't reasonably allocate 10Mb to each one. So now you either need to determine statically how much stack space a task will need (Turing-hard), or you'll need to allocate stack chunks (e.g., per function call), which widely available C/C++ compilers don't do (e.g, the one you are likely using). THe last way out is to throttle task creation to limit it to a few hundred at any instant, and multiplex a few-hundred really huge stacks among the tasks that are live. You can't do the last if the tasks can interlock/suspend state, because you'll run into your threshold. So you can only do this if the tasks only do computation. That seems like a pretty severe constraint.

For PARLANSE, we built a compiler that allocates activation records on the heap for each function call.

Ira Baxter
Or you do the sane thing, and don't allocate space to tasks until they're actually running, and don't think of tasks as things to suspend and resume, but rather to run from execution to completion.
Novelocrat
Your solution isn't sane. If you build complex systems, when one piece of work can call arbitrary other pieces of work, you can't guarantee that your task won't need suspension. You can certainly force this property to be true; you'll then have a hard time building complex systems. We build million-line paralell programs in PARLANSE.
Ira Baxter
+2  A: 

If you're looking for a standalone workstealing queue class implementation in C++ built on pthread or boost::thread, good luck, to my knowledge there isn't one.

However, as others have said Cilk, TBB and Microsoft's PPL all have workstealing implementations under the hood.

The question is do you want to use a workstealing queue or implement one? If you just want to use one then the choices above are good starting points simply scheduling a 'task' in any of them will suffice.

As BlueRaja said the task_group & structured_task_group in PPL will do this, also note that these classes are available in the latest version of Intel's TBB as well. The parallel loops (parallel_for, parallel_for_each) are also implemented with workstealing.

If you must look at source rather than use an implementation, TBB is OpenSource and Microsoft ships the sources for its CRT, so you can go spelunking.

You can also look on Joe Duffy's blog for a C# implementation (but it's C# and the memory model is different).

-Rick

Rick