views:

239

answers:

5

I run massively parallel scientific computing jobs on a shared Linux computer with 24 cores. Most of the time my jobs are capable of scaling to 24 cores when nothing else is running on this computer. However, it seems like when even one single-threaded job that isn't mine is running, my 24-thread jobs (which I set for high nice values) only manage to get ~1800% CPU (using Linux notation). Meanwhile, about 500% of the CPU cycles (again, using Linux notation) are idle. Can anyone explain this behavior and what I can do about it to get all of the 23 cores that aren't being used by someone else?

Notes:

  1. In case it's relevant, I have observed this on slightly different kernel versions, though I can't remember which off the top of my head.

  2. The CPU architecture is x64. Is it at all possible that the fact that my 24-core jobs are 32-bit and the other jobs I'm competing w/ are 64-bit is relevant?

Edit: One thing I just noticed is that going up to 30 threads seems to alleviate the problem to some degree. It gets me up to ~2100% CPU.

A: 

It might be worthwhile to use mpstat (part of the sysstat package) to figure out if you have entire CPUs sitting idle while others are fully utilized. It should give you a more detailed view of the utilization than top or vmstat: run mpstat -P ALL to see 1 line per CPU.

As an experiment, you might try setting the CPU affinity on each thread such that each is bound to an individual CPU; this would let you see what performance is like if you don't let the kernel scheduler decide which CPU a task is scheduled on. It's not a good permanent solution, but if it helps a lot it gives you an idea of where the scheduler is falling short.

Eric Seppanen
Unfortunately I don't have admin privileges and sysstat isn't installed.
dsimcha
It's not hard to build sysstat from source.
Eric Seppanen
+1  A: 

Do your threads have to synchronize? If so, you might have the following problem:

Assume you have a 4-cpu system, and a 4-thread job. When run alone, threads fan out to use all 4 cores and total usage is near perfect (We'll call this 400%).

If you add one single-threaded interfering job, the scheduler might place 2 of your threads on the same cpu. This means that 2 of your threads are now running at effectively half their normal pace (dramatic simplification), and if your threads need to synchronize periodically, the progress of your job can be limited by the slowest thread, which in this case is running at half normal speed. You would see utilization of only 200% (from your job running 4x 50%) plus 100% (the interfering job) = 300%.

Similarly, if you assume that the interfering job only uses 25% of one processor's time, you might see one of your threads and the interferer on the same CPU. In that case the slowest thread is running at 3/4 normal speed, causing the total utilization to be 300% (4x 75%) + 25% = 325%. Play with these numbers and it's not hard to come up with something similar to what you're seeing.

If that's the problem, you can certainly play with priorities to give unwelcome tasks only tiny fractions of available CPU (I'm assuming I/O delays aren't a factor). Or, as you've found, try to increase threads so that each CPU has, say, 2 threads, minus a few to allow for system tasks. In this way a 24 core system might run best with, say, 46 threads (which always leaves half of 2 cores' time available for system tasks).

Eric Seppanen
Of course, caf's suggestion of 23 threads is probably better than my suggestion of 46 threads as a way of getting 2300% utilization.
Eric Seppanen
+3  A: 

It is possible that this is caused by the scheduler trying to keep each of your tasks running on the same CPU that it was previously running on (it does this because the task has likely brought its working set into that CPU's cache - it's "cache hot").

Here's a few ideas you can try:

  • Run twice as many threads as you have cores;
  • Run one or two less threads than you have cores;
  • Reduce the value of /proc/sys/kernel/sched_migration_cost (perhaps down to zero);
  • Reduce the value of /proc/sys/kernel/sched_domain/.../imbalance_pct down closer to 100.
caf
A: 

Do you think the bottleneck is in your application or the kernel's scheduling algorithm? Before you start tweaking scheduling parameters, I suggest you try running a simple multi-threaded application to see if it exhibits the same behavior as your application.

// COMPILE WITH: gcc threads.c -lpthread -o thread
#include <pthread.h>
#define NUM_CORES 24

void* loop_forever(void* argument) {
    int a;
    while(1) a++;
}

void main() {
    int i;
    pthread_t threads[NUM_CORES];

    for (i = 0; i < NUM_CORES; i++)
        pthread_create(&threads[i], 0, loop_forever, 0);

    for (i = 0; i < NUM_CORES; i++)
        pthread_join(threads[i], 0);
}
Tim Kryger
A: 

Do your threads communicates with each other?

Try to manually bind every thread to cpu, with sched_setaffinity or pthread_setaffinity_np. Scheduler can be rather dumb when working with a lot of relating threads.

osgx