ansaurus

Question

Answer 1

+2 A:

The first approach is enough. No need for complication here. If you start playing with mutexes you risk making hard to detect errors.

Don't start complicating unless you really see that you need this. Syncronization issues (especially in case of many threads instead of many processes) can be really painful.

sharptooth 2009-02-19 10:00:50

I don't think this would work, as two threads could be updating the same histogram element at the same time. Similarly, bin_index(i1, i2, i3, i4) could either access histogram or have other side effects.

Shane MacLaughlin 2009-02-19 10:36:44

Whether or not they will try to do it depends on how bin_index() works.

sharptooth 2009-02-19 11:35:39

Yes, it depends on how histogram works

Joe Soul-bringer 2009-02-20 00:49:02

Answer 2

A:

If you ever do it in .NET, use the Parallel Extensions.

bzlm 2009-02-19 10:02:48

Answer 3

+1 A:

As I understand it, OpenMP was made just for what you are trying to do, although I have to admit I have not used it yet myself. Basically it seems to boil down to just including a header and adding a pragma clause.

You could probably also use Intel's Thread Building Blocks Library.

Adrian Grigore 2009-02-19 10:04:14

Thanks for the link(s), I will have to take a look at that.

David Zaslavsky 2009-02-19 17:56:58

Answer 4

A:

If you want to write multithreaded number crunching code (and you are going to be doing a lot of it in the future) I would suggest you take a look at using a functional language like OCaml or Haskell.

Due to the lack of side effects and lack of shared state in functional languages (well, mostly) making your code run across multiple threads is a LOT easier. Plus, you'll probably find that you end up with a lot less code.

Dan Fish 2009-02-19 10:12:15

Sounds like a great excuse to learn Haskell ;-) How does the speed of something like Haskell or OCaml compare to C?

David Zaslavsky 2009-02-19 17:58:21

Answer 5

+2 A:

The first approach is simple. It is also sufficient if you expect that the load will be balanced evenly over the threads. In some cases, especially if the complexity of bin_index is very dependant on the parameter values, one of the threads could end up with a much heavier task than the rest. Remember: the task is finished when the last threads finishes.

The second approach is a bit more complicated, but balances the load more evenly if the tasks are finegrained enough (the number of tasks is much larger than the number of threads).

Note that you may have issues putting the calculations in separate threads. Make sure that bin_index works correctly when multiple threads execute it simultaneously. Beware of the use of global or static variables for intermediate results.

Also, "histogram[bin_index(i1, i2, i3, i4)] += 1" could be interrupted by another thread, causing the result to be incorrect (if the assignment fetches the value, increments it and stores the resulting value in the array). You could introduce a local histogram for each thread and combine the results to a single histogram when all threads have finished. You could also make sure that only one thread is modifying the histogram at the same time, but that may cause the threads to block each other most of the time.

Renze de Waal 2009-02-19 10:18:35

+1 for the '"histogram[bin_index(i1, i2, i3, i4)] += 1" could be interrupted by another thread' paragraph.

Shane MacLaughlin 2009-02-19 10:30:18

As an alternative to combining thread local historgrams, you could also theoretically have an array of locks or mutexes the same size as the histogram array to avoid unnecessary blocks. This would be a bit more memory efficient for lots of threads.

Shane MacLaughlin 2009-02-19 10:33:46

I do not agree for the histogram stuff. If you write (or read) at different indexes of the array, there is no problem, which seems to be the case here. The interrupt problem is not a problem here.

Jérôme 2009-02-19 11:06:20

@Jerome - What you are saying is that "histogram[bin_index(i1, i2, i3, i4)] += 1" boils down to an atomic operation. This may not be the case, depending on the type of histogram and the side effects of bin_index(i1, i2, i3, i4). You're changing the contents of an array, where type isn't specified.

Shane MacLaughlin 2009-02-19 13:34:02

@smacl: in this case histogram has length ~10000, that's a lot of mutexes ;-) interesting idea though.

David Zaslavsky 2009-02-19 18:00:35

@Jerome - There is no guarantee whatsoever that you are writing in different indexes of the array. With the available knowledge, bin_index(i1, i2, i3, i4) can very well have the same result for different values of i1.

Renze de Waal 2009-02-19 21:25:06

@smacl - A nice ideaa to have more fine grained locking. Definitely worth pursuing.

Renze de Waal 2009-02-19 21:26:19

@Renze: Actually it could. I misunderstood the bin_index function. Then a mutex is necessary for accessing the array.@smacl: I had the idea of a simple C array.

Jérôme 2009-02-20 09:10:47

histogram[bin_index(i1, i2, i3, i4)] += 1... isn't that atomic? I thought only things like "if (x != NULL) x.Foo();" aren't atomic

FryGuy 2009-02-24 02:19:59

It's not atomic, since it consists of at least 3 instructions: fetch, increment, store. Nothing is atomic, really, unless it's explicitly guaranteed... you should check out the output of gcc -S sometime if you want to understand in more detail.

David Zaslavsky 2009-02-24 02:28:30

I thought it would be something like: "push [i1]; ... push [i4]; call bin_index; pop ax; add ax, histogram; add [ax], 1". It's been a long time since I've done asm, so it's probably slightly wrong. At the microcode level, it's not atomic, but I thought with copy-on-write and dirty bits this was ok

FryGuy 2009-02-24 05:43:22

Answer 6

+2 A:

If you never coded a multithread application, I bare you to begin with OpenMP:

the library is now included in gcc by default
this is very easy to use

In your example, you should just have to add this pragma:

#pragma omp parallel shared(histogram)
{
for (int i1 = 0; i1 < N; i1++)
  for (int i2 = 0; i2 < N; i2++)
    for (int i3 = 0; i3 < N; i3++)
      for (int i4 = 0; i4 < N; i4++)
        histogram[bin_index(i1, i2, i3, i4)] += 1;
}

With this pragma, the compiler will add some instruction to create threads, launch them, add some mutexes around accesses to the histogram variable etc... There are a lot of options, but well defined pragma do all the work for you. Basically, the simplicity depends on the data dependency.

Of course, the result should not be optimal as if you coded all by hand. But if you don't have load balancing problem, you maybe could approach a 2x speed up. Actually this is only write in matrix with no spacial dependency in it.

Jérôme 2009-02-19 10:48:10

Answer 7

A:

I agree with Sharptooth that your first approach seems like the only plausible one.

Your single threaded app is continuously assigning to memory. To get any speedup, your several threads would need to also be continuously assigning to memory. If only one thread is assigning at a time, you would get no speedup at all. So if your assignments are guarded, the whole exercise would fail.

This would be a dangerous approach since you assigning to shared memory without a guard. But it seems to be worth the danger (if a x2 speedup matters). If you can be sure that all the values of bin_index(i1, i2, i3, i4) are different in your division of the loop, then it should work since the array assignment would be to a different locations in your shared memory. Still, one always should look and hard at approaches like this.

I assume you would also produce a test routine to compare the results of the two versions.

Edit:

Looking at your bin_index(i1, i2, i3, i4), I suspect your process could not be parallelized without considerable effort.

The only way to divide up the work of calculation in your loop is, again, to be sure that your threads will access the same areas in memory. However, it looks like bin_index(i1, i2, i3, i4) will likely repeat values quite often. You might divide up the iteration into the conditions where bin_index is higher than a cutoff and where it is lower than a cut-off. Or you could divide it arbitrarily and see whether increment is implemented atomically. But any complex threading approach looks unlikely to provide improvement if you can only have two cores to work with to start with.

Joe Soul-bringer 2009-02-19 22:12:04

Answer 8

+1 A:

I would do something like this:

void HistogramThread(int i1, Action<int[]> HandleResults)
{
    int[] histogram = new int[HistogramSize];

    for (int i2 = 0; i2 < N; i2++)
       for (int i3 = 0; i3 < N; i3++)
          for (int i4 = 0; i4 < N; i4++)
             histogram[bin_index(i1, i2, i3, i4)] += 1;

    HandleResults(histogram);
}

int[] CalculateHistogram()
{
    int[] histogram = new int[HistogramSize];

    ThreadPool pool; // I don't know syntax off the top of my head
    for (int i1=0; i1<N; i1++)
    {
       pool.AddNewThread(HistogramThread, i1, delegate(int[] h)
       {
           lock (histogram)
           {
               for (int i=0; i<HistogramSize; i++)
                   histogram[i] += h[i];
           }
       });
    }
    pool.WaitForAllThreadsToFinish();

    return histogram;
}

This way you don't need to share any memory, until the end.

FryGuy 2009-02-24 02:16:56

+1 - that's pretty similar to what I actually wound up doing ;-)

David Zaslavsky 2009-02-24 02:29:37

ansaurus

tags:

views:

answers:

Dividing loop iterations among threads

Edit:

related questions