views:

370

answers:

8

I recently wrote a small number-crunching program that basically loops over an N-dimensional grid and performs some calculation at each point.

for (int i1 = 0; i1 < N; i1++)
  for (int i2 = 0; i2 < N; i2++)
    for (int i3 = 0; i3 < N; i3++)
      for (int i4 = 0; i4 < N; i4++)
        histogram[bin_index(i1, i2, i3, i4)] += 1; // see bottom of question

It worked fine, yadda yadda yadda, lovely graphs resulted ;-) But then I thought, I have 2 cores on my computer, why not make this program multithreaded so I could run it twice as fast?

Now, my loops run a total of, let's say, around a billion calculations, and I need some way to split them up among threads. I figure I should group the calculations into "tasks" - say each iteration of the outermost loop is a task - and hand out the tasks to threads. I've considered

  • just giving thread #n all iterations of the outermost loop where i1 % nthreads == n - essentially predetermining which tasks go to which threads
  • trying to set up some mutex-protected variable which holds the parameter(s) (i1 in this case) of the next task that needs executing - assigning tasks to threads dynamically

What reasons are there to choose one approach over the other? Or another approach I haven't thought about? Does it even matter?

By the way, I wrote this particular program in C, but I imagine I'll be doing the same kind of thing again in other languages as well so answers need not be C-specific. (If anyone knows a C library for Linux that does this sort of thing, though, I'd love to know about it)

EDIT: in this case bin_index is a deterministic function which doesn't change anything except its own local variables. Something like this:

int bin_index(int i1, int i2, int i3, int i4) {
    // w, d, h are constant floats
    float x1 = i1 * w / N,  x2 = i2 * w / N, y1 = i3 * d / N, y2 = i4 * d / N;
    float l = sqrt((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2) + h * h);
    float th = acos(h / l);
    // th_max is a constant float (previously computed as a function of w, d, h)
    return (int)(th / th_max);
}

(although I appreciate all the comments, even those which don't apply to a deterministic bin_index)

+2  A: 

The first approach is enough. No need for complication here. If you start playing with mutexes you risk making hard to detect errors.

Don't start complicating unless you really see that you need this. Syncronization issues (especially in case of many threads instead of many processes) can be really painful.

sharptooth
I don't think this would work, as two threads could be updating the same histogram element at the same time. Similarly, bin_index(i1, i2, i3, i4) could either access histogram or have other side effects.
Shane MacLaughlin
Whether or not they will try to do it depends on how bin_index() works.
sharptooth
Yes, it depends on how histogram works
Joe Soul-bringer
A: 

If you ever do it in .NET, use the Parallel Extensions.

bzlm
+1  A: 

As I understand it, OpenMP was made just for what you are trying to do, although I have to admit I have not used it yet myself. Basically it seems to boil down to just including a header and adding a pragma clause.

You could probably also use Intel's Thread Building Blocks Library.

Adrian Grigore
Thanks for the link(s), I will have to take a look at that.
David Zaslavsky
A: 

If you want to write multithreaded number crunching code (and you are going to be doing a lot of it in the future) I would suggest you take a look at using a functional language like OCaml or Haskell.

Due to the lack of side effects and lack of shared state in functional languages (well, mostly) making your code run across multiple threads is a LOT easier. Plus, you'll probably find that you end up with a lot less code.

Dan Fish
Sounds like a great excuse to learn Haskell ;-) How does the speed of something like Haskell or OCaml compare to C?
David Zaslavsky
+2  A: 

The first approach is simple. It is also sufficient if you expect that the load will be balanced evenly over the threads. In some cases, especially if the complexity of bin_index is very dependant on the parameter values, one of the threads could end up with a much heavier task than the rest. Remember: the task is finished when the last threads finishes.

The second approach is a bit more complicated, but balances the load more evenly if the tasks are finegrained enough (the number of tasks is much larger than the number of threads).

Note that you may have issues putting the calculations in separate threads. Make sure that bin_index works correctly when multiple threads execute it simultaneously. Beware of the use of global or static variables for intermediate results.

Also, "histogram[bin_index(i1, i2, i3, i4)] += 1" could be interrupted by another thread, causing the result to be incorrect (if the assignment fetches the value, increments it and stores the resulting value in the array). You could introduce a local histogram for each thread and combine the results to a single histogram when all threads have finished. You could also make sure that only one thread is modifying the histogram at the same time, but that may cause the threads to block each other most of the time.

Renze de Waal
+1 for the '"histogram[bin_index(i1, i2, i3, i4)] += 1" could be interrupted by another thread' paragraph.
Shane MacLaughlin
As an alternative to combining thread local historgrams, you could also theoretically have an array of locks or mutexes the same size as the histogram array to avoid unnecessary blocks. This would be a bit more memory efficient for lots of threads.
Shane MacLaughlin
I do not agree for the histogram stuff. If you write (or read) at different indexes of the array, there is no problem, which seems to be the case here. The interrupt problem is not a problem here.
Jérôme
@Jerome - What you are saying is that "histogram[bin_index(i1, i2, i3, i4)] += 1" boils down to an atomic operation. This may not be the case, depending on the type of histogram and the side effects of bin_index(i1, i2, i3, i4). You're changing the contents of an array, where type isn't specified.
Shane MacLaughlin
@smacl: in this case histogram has length ~10000, that's a lot of mutexes ;-) interesting idea though.
David Zaslavsky
@Jerome - There is no guarantee whatsoever that you are writing in different indexes of the array. With the available knowledge, bin_index(i1, i2, i3, i4) can very well have the same result for different values of i1.
Renze de Waal
@smacl - A nice ideaa to have more fine grained locking. Definitely worth pursuing.
Renze de Waal
@Renze: Actually it could. I misunderstood the bin_index function. Then a mutex is necessary for accessing the array.@smacl: I had the idea of a simple C array.
Jérôme
histogram[bin_index(i1, i2, i3, i4)] += 1... isn't that atomic? I thought only things like "if (x != NULL) x.Foo();" aren't atomic
FryGuy
It's not atomic, since it consists of at least 3 instructions: fetch, increment, store. Nothing is atomic, really, unless it's explicitly guaranteed... you should check out the output of gcc -S sometime if you want to understand in more detail.
David Zaslavsky
I thought it would be something like: "push [i1]; ... push [i4]; call bin_index; pop ax; add ax, histogram; add [ax], 1". It's been a long time since I've done asm, so it's probably slightly wrong. At the microcode level, it's not atomic, but I thought with copy-on-write and dirty bits this was ok
FryGuy
+2  A: 

If you never coded a multithread application, I bare you to begin with OpenMP:

  • the library is now included in gcc by default
  • this is very easy to use

In your example, you should just have to add this pragma:

#pragma omp parallel shared(histogram)
{
for (int i1 = 0; i1 < N; i1++)
  for (int i2 = 0; i2 < N; i2++)
    for (int i3 = 0; i3 < N; i3++)
      for (int i4 = 0; i4 < N; i4++)
        histogram[bin_index(i1, i2, i3, i4)] += 1;
}

With this pragma, the compiler will add some instruction to create threads, launch them, add some mutexes around accesses to the histogram variable etc... There are a lot of options, but well defined pragma do all the work for you. Basically, the simplicity depends on the data dependency.

Of course, the result should not be optimal as if you coded all by hand. But if you don't have load balancing problem, you maybe could approach a 2x speed up. Actually this is only write in matrix with no spacial dependency in it.

Jérôme
A: 

I agree with Sharptooth that your first approach seems like the only plausible one.

Your single threaded app is continuously assigning to memory. To get any speedup, your several threads would need to also be continuously assigning to memory. If only one thread is assigning at a time, you would get no speedup at all. So if your assignments are guarded, the whole exercise would fail.

This would be a dangerous approach since you assigning to shared memory without a guard. But it seems to be worth the danger (if a x2 speedup matters). If you can be sure that all the values of bin_index(i1, i2, i3, i4) are different in your division of the loop, then it should work since the array assignment would be to a different locations in your shared memory. Still, one always should look and hard at approaches like this.

I assume you would also produce a test routine to compare the results of the two versions.

Edit:

Looking at your bin_index(i1, i2, i3, i4), I suspect your process could not be parallelized without considerable effort.

The only way to divide up the work of calculation in your loop is, again, to be sure that your threads will access the same areas in memory. However, it looks like bin_index(i1, i2, i3, i4) will likely repeat values quite often. You might divide up the iteration into the conditions where bin_index is higher than a cutoff and where it is lower than a cut-off. Or you could divide it arbitrarily and see whether increment is implemented atomically. But any complex threading approach looks unlikely to provide improvement if you can only have two cores to work with to start with.

Joe Soul-bringer
+1  A: 

I would do something like this:

void HistogramThread(int i1, Action<int[]> HandleResults)
{
    int[] histogram = new int[HistogramSize];

    for (int i2 = 0; i2 < N; i2++)
       for (int i3 = 0; i3 < N; i3++)
          for (int i4 = 0; i4 < N; i4++)
             histogram[bin_index(i1, i2, i3, i4)] += 1;

    HandleResults(histogram);
}

int[] CalculateHistogram()
{
    int[] histogram = new int[HistogramSize];

    ThreadPool pool; // I don't know syntax off the top of my head
    for (int i1=0; i1<N; i1++)
    {
       pool.AddNewThread(HistogramThread, i1, delegate(int[] h)
       {
           lock (histogram)
           {
               for (int i=0; i<HistogramSize; i++)
                   histogram[i] += h[i];
           }
       });
    }
    pool.WaitForAllThreadsToFinish();

    return histogram;
}

This way you don't need to share any memory, until the end.

FryGuy
+1 - that's pretty similar to what I actually wound up doing ;-)
David Zaslavsky