views:

354

answers:

8

I have D2 program that, in its current form, is single threaded, and calls the same pure function about 10 to 100 times in an inner loop for each iteration of the outer loop of this program. There is no data dependency among the calls, i.e. no call uses the result from any other call. Overall, this function is called millions of times, and is the main bottleneck in my program. The parameters are unique almost every time, so caching wouldn't help.

At first glance, this seems like the perfect candidate for parallelization. The only problem is that the function only takes about 3 microseconds per call, well below the latency of creating a new thread, and not that far above the overhead of adding a job to a task pool (meaning, acquiring a mutex, allocating memory to hold information about the task, dealing with possible contention for the task pool's queue, etc.). Is there any good way to take advantage of parallelism that is this fine-grained?

+3  A: 

What about creating multiple threads which have their own queue to work from? Because there is no overlapping of the queues you shouldn't have to create locks.

Georg
The main thread still has to add tasks to the separate queues, so you still need a lock.
sth
You can use a lock-free singly linked list implementation (eg Microsoft's Interlocked*SList).
Crashworks
Chances are high, that he doesn't need to push every element into the queue but can just say: Thread1 does the first 100000 calculations, Thread2 100001-200000 and so on.
Georg
+1  A: 

Depending on the structure of your program, you could always combine a group of calls into one task. If each task does 50 function calls, the overhead for the task management isn't such a big factor anymore.

sth
+2  A: 

Don't start each thread up to run a single task then shut it right down.

At the beginning of your program, create a thread for each core just sitting there waiting for data from a queue (pipe, or some mechanism of your own creation). If you can come up with a mechanism where all the threads wait on the same queue, even better, but then the queue's get method will have to be synchronized...

Whenever you have a block of a few hundreds or thousands of your processes to be calculated, drop the entire block into the next empty queue.

You'll actually end up with one or more threads feeding the queues, a bunch of threads processing data from the queues, and one or more reading and dealing with results.

You may need to put enough data in the "items" you are processing to be able to tell what to do with them after you are done. They should almost certainly be an object, and you may want them to contain state information.

You probably don't want more threads doing processing than you have cores.

Edit: Also look at some of the concurrent library, like ThreadPoolExecutor. It's easy to forget the concurrent library (like I just did), that's probably exactly what you were looking for (hence the emphasis)

Bill K
+1  A: 

This sounds like something where SIMD instructions can help. If you've got an auto-vectorizing compiler, you should be able to rewrite the function to operate on 4 values simultaneously, and the compiler can condense that into the appropriate SSE instructions. This could help cut down on the function call overhead. If your compiler isn't good at auto-vectorizing code, then you may be able to use SSE intrinsics to get almost down to assembly level to program the body of the function.

With current compilers it's really much better to write the SIMD code out as intrinsics yourself rather than leaving it up to the vectorizer. Yes, in theory modern compilers *should* be able to vectorize code properly on their own, but in practice they *don't*.
Crashworks
+2  A: 

As suggested above, don't kick off a thread every time you enter this function, and moreover have a "job" granularity larger than one operation of the inner function so that the job-creation overhead is well amortized. Describing your original routine as something like:

void OuterFunction( Thingy inputData[N] )
{
  for ( int i = 0 ; i < N ; ++i )
    InnerFunction( inputData[i] );
}

We'd solve your problem by (assuming a job queue system is present):

void JobFunc( Thingy inputData[], int start, int stop )
{
  for ( int i = start ; i < stop ; ++i )
    InnerFunction( inputData[i] );  
}
void OuterFunction( Thingy inputData[N], int numCores )
{
   int perCore = N / numCores; // assuming N%numCores=0 
                               // (omitting edge case for clarity)
   for ( int c = 0 ; c < numCores ; ++c )
     QueueJob( JobFunc, inputData, c * perCore, (c + 1) * perCore );
}

So long as your input data is completely independent, as you say in your original question, you needn't lock it; synchronization is only necessary when there is dependency between the threads and here there is none.

Also, at this level of performance microoptimizations start to become relevant: most importantly, cache locality. Prefetching can get you a surprisingly long way.

Then consider the possibility of SIMD you can vectorize it to run four input points through a single register simultaneously. With four cores and 4-wide SIMD you can theoretically get a 16x speedup, but this assumes that the work InnerFunction is doing is mostly a fixed mathematical function, since branching tends to obliterate SSE/VMX performance gains.

Crashworks
+2  A: 

What a fun question... as you've noted you won't be able to afford the overheads associated with traditional locking for a work queue for this. I'd encourage you to try to use one of the existing fine-grained task based programming environments if you can... I think about this in three buckets of work:

The first chunk of the problem is to ensure safety, correctness and parallelizability, and it sounds like you have that covered because your function is pure.

I think the next most challenging portion is describing the concurrency, specifically you mention this function is called many many times. Can you pipeline this and separate scheduling the function from its work? If you can't pipeline this, does it look like a parallel loop, a tree traversal or is it more unstructured than this. Specifically, obeying Amdahl if you can't overlap the work and ensure that there are several instances of it or something else running at the same time, you're effectively serial even though you are pure. Anything that you can do to refactor the work into a pipeline, a recursive tree traversal (or parallel loop) or if you must more unstructured work with explicity dependencies between tasks will help here regardless of the library used.

The last area I think about is to ensure that there is efficient execution on your platform and this involves reducing overheads and contention in both your code and the scheduling code and ensuring that any serial code is absolutely as efficient as possible. If you can't use one of the existing libraries and must build your own, I'd encourage you to look at a work-stealing queue and self guided scheduling algorithms, as you've noted you won't be able to see gains from using a traditional locks because their costs outweigh your function costs and you'll most likely need to look at lock-free techniques to reduce the cost of scheduling and removing a task onto whatever queue you use. You'll also need to pay a lot of attention to sharing and contention both within your scheduling algorithm and within your function, because at this level of granularity in addition to the usual branch misprediction and instruction throughput issues, you'll also need to look at shared state and contention even on reads because they can be sources of contention too.

I'm sorry if this wasn't super specific, but I hope it was useful.

Rick
A: 

you might be able to turn the loop inside out using Compare-and-Swap to get an atomic lock free increment:

void OuterFunction()
{
  for(int i = 0; i < N; i++)
    InnerFunction(i);
}

goes to:

void OuterFunction()
{
   int i = 0, j = 0;

   void Go()
   {
      int k;
      while((k = atomicInc(*i)) < N)
      {
         InnerFunction(k);

         atomicInc(*j);
      }
   }

   for(int t = 0; t < ThreadCount - 1; t++) Thread.Start(&Go);

   Go(); // join in

   while(j < N) Wait(); // let everyone else catch up.
}


Edit: my threading is rusty so that won't compile because the names are all wrong

BCS
A: 

There is no data dependency among the calls, i.e. no call uses the result from any other call.

That will help with parallelisation, but be absolutely certain that the function has no side-effects at all. If the function is updating a data structure, is it thread-safe? If it's doing IO, will the IO just end up being a bottleneck if you manage to parallelise execution of the function?

If the answer is "yes" to these questions then the previous suggestions are fine, just try to maximise the granularity of the app by assigning executions of the function to as many as possible per thread.

Still, you probably won't get any benefits from massive parallelism, but maybe some more modest speedup can be had...

alatkins