I am trying to squeeze the most performance out of a Linux block driver for a high-end storage device. One problem that has me a bit stumped at the moment is this: if a user task starts an I/O operation (read or write) on one CPU, and the device interrupt occurs on another CPU, I incur about 80 microseconds of delay before the task resumes execution.
I can see this using O_DIRECT against the raw block device, so this is not page cache or filesystem- related. The driver uses make_request
to receive operations, so it has no request queue and does not utilize any kernel I/O scheduler (you'll have to trust me, it's way faster this way).
I can demonstrate to myself that the problem occurs between calling bio_endio
on one CPU and the task being rescheduled on another CPU. If the task is on the same CPU, it starts very quickly, and if the task is on another physical CPU, it takes a lot longer -- usually about 80 microseconds longer on my current test system (x86_64 on Intel 5520 [NUMA] chipset).
I can instantly double my performance by setting the process and IRQ cpu affinity to the same physical CPU, but that's not a good long-term solution-- I'd rather be able to get good performance no matter where the I/Os originate. And I only have one IRQ so I can only steer it to one CPU at a time -- no good if many threads are running on many CPUs.
I can see this problem on kernels from Centos 5.4's 2.6.18 to the mainline 2.6.32.
So the question is: why does it take longer for the user process to resume, if I called bio_endio
from another CPU? Is this a scheduler issue? And is there any way to eliminate or lower the delay?