views:

410

answers:

2

I am impressed with intel thread building blocks. I like how i should write task and not thread code and i like how it works under the hood with my limited understanding (task are in a pool, there wont be 100 threads on 4cores, a task is not guaranteed to run because it isnt on its own thread and may be far into the pool. But it may be run with another related task so you cant do bad things like typical thread unsafe code).

I wanted to know more about writing task. I like the 'Task-based Multithreading - How to Program for 100 cores' video here http://www.gdcvault.com/sponsor.php?sponsor_id=1 (currently second last link. WARNING it isnt 'great'). My fav part was 'solving the maze is better done in parallel' which is around the 48min mark (you can click the link on the left side. That part is really all you need to watch if any).

However i like to see more code examples and some API of how to write task. Does anyone have a good resource? I have no idea how a class or pieces of code may look after pushing it onto a pool or how weird code may look when you need to make a copy of everything and how much of everything is pushed onto a pool.

+4  A: 

Java has a parallel task framework similar to Thread Building Blocks - it's called the Fork-Join framework. It's available for use with the current Java SE 6 and to be included in the upcoming Java SE 7.

There are resources available for getting started with the framework, in addition to the javadoc class documentation. From the jsr166 page, mentions that "There is also a wiki containing additional documentation, notes, advice, examples, and so on for these classes."

The fork-join examples, such as matrix multiplication are a good place to start.

I used the fork-join framework in solving some of Intel's 2009 threading challenges. The framework is lightweight and low-overhead - mine was the only Java entry for the Kight's Tour problem and it outperformed other entries in the competition. The java sources and writeup are available from the challenge site for download.

EDIT:

I have no idea how a class or pieces of code may look after pushing it onto a pool [...]

You can make your own task by subclassing one of the ForKJoinTask subclasses, such as RecursiveTask. Here's how to compute the fibonacci sequence in parallel. (Taken from the RecursiveTask javadocs - comments are mine.)

 // declare a new task, that itself spawns subtasks. 
 // The task returns an Integer result.
 class Fibonacci extends RecursiveTask<Integer> {
   final int n;      // the n'th number in the fibonacci sequence to compute
   Fibonnaci(int n) { this.n = n; } // constructor
   Integer compute() {   // this method is the main work of the task
     if (n <= 1)         // 1 or 0, base case to end recursion
        return n;
     Fibonacci f1 = new Fibonacci(n - 1);  // create a new task to compute n-1
     f1.fork();                            // schedule to run asynchronously
     Fibonacci f2 = new Fibonacci(n - 2);  // create a new task to compute n-2
     return f2.invoke() + f1.join();       // wait for both tasks to compute.
       // f2 is run as part of this task, f1 runs asynchronously. 
       // (you could create two separate tasks and wait for them both, but running
       // f2 as part of this task is a little more efficient.
   }
 }

You then run this task and get the result

// default parallelism is number of cores
ForkJoinPool pool = new ForkJoinPool(); 
Fibonacci f = new Fibonacci(100);
int result = pool.invoke(f);

This is a trivial example to keep things simple. In practice, performance would not be so good, since the work executed by the task is trivial compared to the overhead of the task framework. As a rule of thumb, a task should perform some significant computation - enough to make the framework overhead insignificant, yet not so much that you end up with one core at the end of the problem running one large task. Splitting large tasks into smaller ones ensures that one core isn't left doing lots of work while other cores are idle - using smaller tasks keeps more cores busy, but not so small that the task does no real work.

[...] or how weird code may look when you need to make a copy of everything and how much of everything is pushed onto a pool.

Only the tasks themselves are pushed into a pool. Ideally you don't want to be copying anything: to avoid interference and the need for locking, which would slow down your program, your tasks should ideally be working with independent data. Read-only data can be shared amongst all tasks, and doesn't need to be copied. If threads need to co-operate building some large data structure, it's best they build the pieces separately and then combine them at the end. The combining can be done as a separate task, or each task can add it's piece of the puzzle to the overall solution. This often does require some form of locking, but it's not a considerable performance issue if the work of the task is much greater than the the work updating the solution. My Knight's Tour solution takes this approach to update a common repository of tours on the board.

Working with tasks and concurrency is quite a paradigm shift from regular single-threaded programming. There are often several designs possible to solve a given problem, but only some of these will be suitable for a threaded solution. It can take a few attempts to get the feel for how to recast familiar problems in a multi-threaded way. The best way to learn is to look at the examples, and then try it for yourself. Always profile, and meausre the effects of varying the number of threads. You can explicitly set the number of threads (cores) to use in the pool in the pool constructor. When tasks are broken up linearly, you can expect near linear speedup as the number of threads increases.

mdma
A: 

Playing with "frameworks" which claim to be solving unsolvable (optimal task scheduling is NP hard) is not going to help you at all - reading books and than articles on concurrent algorithms will. So called "tasks" are nothing more that a fancy name for defining separability of the problem (parts that can be computed independently of each other). Class of separable problems is very small - and they are already covered in old books.

For problems which are not separable you have to plan phases and data barriers between phases to exchange data. Optimal orchestration of data barriers for simultaneous data exchange is not just NP hard but impossible to solve in a general way in principle - you'd need to examine history of all possible interleavings - that's like the power set of already exponential set (like going from N to R in math). The reason I mention is to make it clear that no software can ever do this for you and that how to do it is intrinsically dependent on actual algorithm and a make or break of whether parallelization is feasible at all (even if it's theoretically possible).

When you enter high parallelism you can't even maintain a queue, you don't even have a memory bus anymore - imagine 100 CPU-s trying to sync up on just a single shared int or trying to do memory bus arbitrage. You have to pre-plan and pre-configure everything that's going to run and essentially prove correctness on a white board. Intel's Threading Building Blocks are a small kid in that world. They are for small number of cores which can still share a memory bus. Running separable problems is a no-brainer which you can do without any "framework".

So you are back to having to read as many different parallel algorithms as you can. It normally takes 1-3 years to research approximately optimal data barrier layout for one problem. It becomes layout when you go for say 16+ cores on a single chip since only first neighbors can exchange data efficiently (during one data barrier cycle). So you'll actually learn much more by looking at CUDA and papers and results with IBM-s experimental 30-core CPU than Intel's sales pitch or some Java toy.

Beware of demo problems for which the size of resources wasted (number of cores and memory) is much bigger that the speedup they achieve. If it takes 4 cores and 4x RAM to solve something 2x faster, the solution is not scalable for parallelization.

ZXX