views:

366

answers:

2

Hello,

I'm building a very high performance Linux server (based on epoll, non-blocking sockets, and async disk IO [based on io_submit/io_getevents/eventfd]). Some of my benchmarks show that the way I handle sockets isn't efficient enough for my requirements. In particular, I'm concerned with getting data from the userspace buffer to the network card, and from the network card back to the userspace buffer (let's ignore sendfile call for now).

From what I understand, calling read/write on a non-blocking Linux socket isn't fully asynchronous - the system call blocks while it copies the buffer from the userspace to the kernel (or the other way around), and only then returns. Is there a way to avoid this overheard in Linux? In particular, is there a fully asynchronous write call that I can make on a socket that would return immediately, DMA the userspace buffer to the network card as necessary, and signal/set an event/etc. on completion? I know Windows has an interface for this, but I couldn't find anything about this in Linux.

Thanks!

A: 

AFAIK you are using the most efficient calls available if you cant use sendfile(2). Various aspects of efficient high performance networking code is covered by The C10K problem

Ronny Vindenes
+4  A: 

There's been some talk on linux-kernel recently about providing an API for something along these lines, but the sticking point is that you can't DMA from general userspace buffers to the network card, because:

  • What looks like contiguous data in the userspace linear address space is probably not-contiguous in physical memory, which is a problem if the network card doesn't do scatter-gather DMA;
  • On many machines, not all physical memory addresses are "DMA-able". There's no way at the moment for a userspace application to specifically request a DMA-able buffer.

On recent kernels, you could try using vmsplice and splice together to achieve what you want - vmsplice the pages (with SPLICE_F_GIFT) you want to send into a pipe, then splice them (with SPLICE_F_MOVE) from the pipe into the socket.

caf
Thanks! Do you have a hunch on how efficient this would be over read/write?In general, is there a "best practices" guide somewhere for this kind of stuff? It took days to sift through all the polling and signalling APIs, and then more time to benchmark it all before I found a best practice for multiplexing sockets and async IO. It would really help to find a sockets best practices guide. There's the C10K problem page, but most of the information there is many years old (which is ages in kernel time), and usually very inconclusive.
Slava Akhmechet
`splice` and friends are fairly new, so I'm not sure if there's any kind of "best practices" guide for them yet. They should be quite low latency and zero-copy where possible - that's the whole point of them. You could try asking on linux-net and/or linux-kernel mailing lists.
caf