views:

540

answers:

7

I'm writing a book on multicore programming using .NET 4 and I'm curious to know what parts of multicore programming people have found difficult to grok or anticipate being difficult to grok?

+2  A: 

To understand: low level memory details like the difference between acquire and release semantics of memory.

Most of the rest of the concepts and ideas (anything can interleave, race conditions, ...) are not that difficult with a little usage.

Of course the practice, especially if something is failing sometimes, is very hard as you need to work at multiple levels of abstraction to understand what is going on, so keep your design simple and as far as possible design out the need for locking etc. (e.g. using immutable data and higher level abstractions).

Richard
+2  A: 

Its not so much theoretical details, but more the practical implementation details which trips people up.

What's the deal with immutable data structures?

All the time, people try to update a data structure from multiple threads, find it too hard, and someone chimes in "use immutable data structures!", and so our persistent coder writes this:

ImmutableSet set;

ThreadLoop1()
    foreach(Customer c in dataStore1)
        set = set.Add(ProcessCustomer(c));

ThreadLoop2()
    foreach(Customer c in dataStore2)
        set = set.Add(ProcessCustomer(c));

Coder has heard all their lives that immutable data structures can be updated without locking, but the new code doesn't work for obvious reasons.

Even if your targeting academics and experienced devs, a little primer on the basics of immutable programming idioms can't hurt.

How to partition roughly equal amounts of work between threads?

Getting this step right is hard. Sometimes you break up a single process into 10,000 steps which can be executed in parallel, but not all steps take the same amount of time. If you split the work on 4 threads, and the first 3 threads finish in 1 second, and the last thread takes 60 seconds, your multithreaded program isn't much better than the single-threaded version, right?

So how do you partition problems with roughly equal amounts of work between all threads? Lots of good heuristics on solving bin packing problems should be relevant here..

How many threads?

If your problem is nicely parallelizable, adding more threads should make it faster, right? Well not really, lots of things to consider here:

Even a single core processor, adding more threads can make a program faster because more threads gives more opportunities for the OS to schedule your thread, so it gets more execution time than the single-threaded program. But with the law of diminishing returns, adding more threads increasing context-switching, so at a certain point, even if your program has the most execution time the performance could still be worse than the single-threaded version.

So how do you spin off just enough threads to minimize execution time?

And if there are lots of other apps spinning up threads and competing for resources, how do you detect performance changes and adjust your program automagically?

Juliet
Jon is a big proponent of functional programming, which really changes which tasks are hard. I guess you missed the "f#" tag on the question, he ought to have pointed out his functional approach. Bin packing might be one approach to partitioning, but work-stealing and work queues, where the partitioning isn't determined a-priori, seem to be a lot more popular in the real world.
Ben Voigt
@Ben: Note that I advocate impure functional languages like F# and do not shun mutation and mutable data structures at all. Indeed, I believe mutation is often essential in the context of parallelism.
Jon Harrop
+4  A: 

Since you write a whole book for multi-core programming in .Net.

I think you can also go beyond multi-core a little bit.

For example, you can use a chapter talking about parallel computing in a distributed system in .Net. Unlikely, there is no mature frameworks in .Net yet. DryadLinq is the closest. (On the other side, Hadoop and its friends in Java platform are really good.)

You can also use a chapter demonstrating some GPU computing stuff.

Yin Zhu
+4  A: 

I guess some of it depends on how basic or advanced the book/audience is. When you go from single-threaded to multi-threaded programming for the first time, you typically fall off a huge cliff (and many never recover, see e.g. all the muddled questions about Control.Invoke).

Anyway, to add some thoughts that are less about the programming itself, and more about the other related tasks in the software process:

  • Measuring: deciding what metric you are aiming to improve, measuring it correctly (it is so easy to accidentally measure the wrong thing), using the right tools, differentiating signal versus noise, interpreting the results and understanding why they are as they are.

  • Testing: how to write tests that tolerate unimportant non-determinism/interleavings, but still pin down correct program behavior.

  • Debugging: tools, strategies, when "hard to debug" implies feedback to improve your code/design and better partition mutable state, etc.

  • Physical versus logical thread affinity: understanding the GUI thread, understanding how e.g. an F# MailboxProcessor/agent can encapsulate mutable state and run on multiple threads but always with only a single logical thread (one program counter).

  • Patterns (and when they apply): fork-join, map-reduce, producer-consumer, ...

I expect that there will be a large audience for e.g. "help, I've got a single-threaded app with 12% CPU utilization, and I want to learn just enough to make it go 4x faster without much work" and a smaller audience for e.g. "my app is scaling sub-linearly as we add cores because there seems to be contention here, is there a better approach to use?", and so a bit of the challenge may be serving each of those audiences.

Brian
With regard to testing, my company has worked in Microsoft's CHESS tool (http://research.microsoft.com/en-us/projects/chess/) to exhaustively test all interleavings of our multithreaded code, and its been pure awesome from start to finish. Could probably be useful here.
Juliet
+5  A: 

What's a useful unit of work to parallelize, and how do I find/organize one?

All these parallelism primitives aren't helpful if you fork a piece of work that is smaller than the forking overhead; in fact, that buys you a nice slowdown instead of what you are expecting.

So one of the big problems is finding units of work that are obviously more expensive than the parallelism primitives. A key problem here is that nobody knows what anything costs to execute, including the parallelism primitives themselves. Clearly calibrating these costs would be very helpful. (As an aside, we designed, implemented, and daily use a parallel programming langauge, PARLANSE whose objective was to minimize the cost of the parallelism primitives by allowing the compiler to generate and optimize them, with the goal of making smaller bits of work "more parallelizable").

One might also consider discussion big-Oh notation and its applications. We all hope that the parallelism primitives have cost O(1). If that's the case, then if you find work with cost O(x) > O(1) then that work is a good candidate for parallelization. If your proposed work is also O(1), then whether it is effective or not depends on the constant factors and we are back to calibration as above.

There's the problem of collecting work into large enough units, if none of the pieces are large enough. Code motion, algorithm replacement, ... are all useful ideas to achieve this effect.

Lastly, there's the problem of synchnonization: when do my parallel units have to interact, what primitives should I use, and how much do those primitives cost? (More than you expect!).

Ira Baxter
+4  A: 

One thing that has tripped me up is which approach to use to solve a particular type of problem. There's agents, there's tasks, async computations, MPI for distribution - for many problems you could use multiple methods but I'm having difficulty understanding why I should use one over another.

Bohdan Szymanik
+1  A: 

I find the conceptions of synchronized data moving across worker nodes in complex patterns very hard to visualize and program.

Usually I find debugging to be a bear, also.

Paul Nathan