views:

130

answers:

4

I am trying to squeeze the most performance out of a Linux block driver for a high-end storage device. One problem that has me a bit stumped at the moment is this: if a user task starts an I/O operation (read or write) on one CPU, and the device interrupt occurs on another CPU, I incur about 80 microseconds of delay before the task resumes execution.

I can see this using O_DIRECT against the raw block device, so this is not page cache or filesystem- related. The driver uses make_request to receive operations, so it has no request queue and does not utilize any kernel I/O scheduler (you'll have to trust me, it's way faster this way).

I can demonstrate to myself that the problem occurs between calling bio_endio on one CPU and the task being rescheduled on another CPU. If the task is on the same CPU, it starts very quickly, and if the task is on another physical CPU, it takes a lot longer -- usually about 80 microseconds longer on my current test system (x86_64 on Intel 5520 [NUMA] chipset).

I can instantly double my performance by setting the process and IRQ cpu affinity to the same physical CPU, but that's not a good long-term solution-- I'd rather be able to get good performance no matter where the I/Os originate. And I only have one IRQ so I can only steer it to one CPU at a time -- no good if many threads are running on many CPUs.

I can see this problem on kernels from Centos 5.4's 2.6.18 to the mainline 2.6.32.

So the question is: why does it take longer for the user process to resume, if I called bio_endio from another CPU? Is this a scheduler issue? And is there any way to eliminate or lower the delay?

+1  A: 

If you finish your I/O on a particular CPU, then that processor is immediately free to start working on a new thread - if you finish your i/o on the same processor as the thread the requested it, then the next thread is likely to be the one you finished i/o for.

On the other hand, if you finish on a different processor, the thread that requested the i/o won't get to run immediately - it has to wait until whatever's currently executing finishes its quantum or otherwise relinquishes the CPU.

As far as I understand.

Anon.
That's a good idea, and something I'll look into; however, during my testing I am only running IO-bound threads, so there should not be any other runnable process most of the time.
Eric Seppanen
+1  A: 

It could just be the latency inherent in issuing an IPI from the CPU that completed the bio to the CPU where the task gets scheduled - to test this, try booting with idle=poll.

caf
Interesting thought. It gave me a tiny performance boost (2-3%) but doesn't seem to affect the main problem (which cuts performance in half).
Eric Seppanen
A: 

Looks like I misunderstood the problem a bit: it seems to be related to cache misses; when the cpu handling interrupts wasn't the cpu that started the i/o, the cpu can hit 100% utilization, and then everything slows down, giving the impression that there is a long delay communicating between cpus.

Thanks to everyone for their ideas.

Eric Seppanen
Yeah I was thinking along these lines, but I was reluctant to chime in with even more speculation. It could well be a lock (perhaps a task lock?) bouncing from one cache to another. Really though, it sounds like you need to set things up so that you're doing a lot more IO per interrupt.
caf
No locks, just a really huge amount of IO. And a NUMA system that makes cross-cpu misses really expensive. Bigger transfers are better, but sometimes you have to take what the kernel or application give you.
Eric Seppanen
+1  A: 

This patch was just posted to LKML, implementing QUEUE_FLAG_SAME_CPU in the block device layer, which is described as:

Add a flag to make request complete on cpu where request is submitted. The flag implies QUEUE_FLAG_SAME_COMP. By default, it is off.

It sounds like it might be just what you need...

caf
Yes, that sounds like a good idea. Unfortunately, I'm using make_request to intercept bios before they hit the queue, so I can't take advantage of this, but it's nice to know that kernel folks are thinking in that direction. I'll have to have a look at how this is implemented and see if there are any good ideas worth borrowing.
Eric Seppanen