views:

250

answers:

2

I have my own multithreaded C program which scales in speed smoothly with the number of CPU cores.. I can run it with 1, 2, 3, etc threads and get linear speedup.. up to about 5.5x speed on a 6-core CPU on a Ubuntu Linux box.

I had an opportunity to run the program on a very high end Sunfire x4450 with 4 quad-core Xeon processors, running Red Hat Enterprise Linux. I was eagerly anticipating seeing how fast the 16 cores could run my program with 16 threads.. But it runs at the same speed as just TWO threads!

Much hair-pulling and debugging later, I see that my program really is creating all the threads, they really are running simultaneously, but the threads themselves are slower than they should be. 2 threads runs about 1.7x faster than 1, but 3, 4, 8, 10, 16 threads all run at just net 1.9x! I can see all the threads are running (not stalled or sleeping), they're just slow.

To check that the HARDWARE wasn't at fault, I ran SIXTEEN copies of my program independently, simultaneously. They all ran at full speed. There really are 16 cores and they really do run at full speed and there really is enough RAM (in fact this machine has 64GB, and I only use 1GB per process).

So, my question is if there's some OPERATING SYSTEM explanation, perhaps some per-process resource limit which automatically scales back thread scheduling to keep one process from hogging the machine.

Clues are:

  1. My program does not access the disk or network. It's CPU limited. Its speed scales linearly on a single CPU box in Ubuntu Linux with a hexacore i7 for 1-6 threads. 6 threads is effectively 6x speedup.
  2. My program never runs faster than 2x speedup on this 16 core Sunfire Xeon box, for any number of threads from 2-16.
  3. Running 16 copies of my program single threaded runs perfectly, all 16 running at once at full speed.
  4. top shows 1600% of CPUs allocated. /proc/cpuinfo shows all 16 cores running at full 2.9GHz speed (not low frequency idle speed of 1.6GHz)
  5. There's 48GB of RAM free, it is not swapping.

What's happening? Is there some process CPU limit policy? How could I measure it if so? What else could explain this behavior?

Thanks for your ideas to solve this, the Great Xeon Slowdown Mystery of 2010!

+1  A: 

My initial guess would be shared memory bottlenecks. From what you say, your performance pretty much flatlines after 2 CPUs. You initially blame Redhat, but I'd be curious to see what happens if you install Ubuntu on the same hardware. I assume, of course, that you're running 64 bit SMP kernels across both tests.

It's probably not possible that the motherboard would peak at utilizing 2 CPUs. You have another machine with multiple cores that has provided better performance. Do you have hyperthreading turned on with the new machine? (and how does that answer compare to the old machine?). You're not, by chance, running in a virtualized environment?

Overall, your evidence is pointing to a ludicrously slow bottleneck somewhere. As you said, you're not I/O bound, so that leaves the CPU and memory. Either something is wrong with the hardware, or something is wrong with the hardware. Test one by changing the other, and you'll narrow down your possibilities quickly.

Matt Simmons
Most of those points are EXCELLENT and are what I looked at first. But the fact that 16 individual copies run at full speed means it's not a CPU, memory, hyperthreading, or virtualization issue. I'm now convinced it's not a programming issue, but something OS related, but I don't know where to look. Installing a new OS would obviously be a great test, except this is on a $22,000 SunFire server which I have access to but don't own.
A: 

Do some research on rlimit - it's quite possible the shell/user acct you're running in has some RH-default or admin-set resource limits in place.

Dave