views:

50

answers:

1

I've been researching a number of networking libraries and frameworks lately such as libevent, libev, Facebook Tornado, and Concurrence (Python).

One thing I notice in their implementations is the use of application-level per-client read/write buffers (e.g. IOStream in Tornado) -- even HAProxy has such buffers.

In addition to these application-level buffers, there's the OS kernel TCP implementation's buffers per socket.

I can understand the app/lib's use of a read buffer I think: the app/lib reads from the kernel buffer into the app buffer and the app does something with the data (deserializes a message therein for instance).

However, I have confused myself about the need/use of a write buffer. Why not just write to the kernel's send/write buffer? Is it to avoid the overhead of system calls (write)? I suppose the point is to be ready with more data to push into the kernel's write buffer when the kernel notifies the app/lib that the socket is "writable" (e.g. EPOLLOUT). But, why not just do away with the app write buffer and configure the kernel's TCP write buffer to be equally large?

Also, consider a service for which disabling the Nagle algorithm makes sense (e.g a game server). In such a configuration, I suppose I'd want the opposite: no kernel write buffer but an application write buffer, yes? When the app is ready to send a complete message, it writes the app buffer via send() etc. and the kernel passes it through.

Help me to clear up my head about these understandings if you would. Thanks!

+1  A: 

Well, speaking for haproxy, it has no distinction between read and write buffers, a buffer is used for both purposes, which saves a copy. However, it is really painful to do some changes. For instance, sometimes you have to rewrite an HTTP header and you have to manage to move data correctly for your rewrite, and to save some state about the previous header's value. In haproxy, the connection header can be rewritten, and its previous and new states are saved because they are need later, after being rewritten. Using a read and a write buffer, you don't have this complexity, as you can always look back in your read buffer if you need any original data.

Haproxy is also able to make use of splicing between sockets on Linux. This means that it does not read nor write data, it just tells the kernel what to take where, and where to move it. The kernel then automatically moves pointers without copying data to transfer TCP segments from a network card to another one (when possible), but data are then never transferred to user space, thus avoiding a double copy.

You're completely right about the fact that in general you don't need to copy data between buffers. It's a waste of memory bandwidth. Haproxy runs at 10Gbps with 20% CPU with splicing, but without splicing (2 more copies), it's close to 100%. But then consider the complexity of the alternatives, and make your choice.

Hoping this helps.

Willy Tarreau
Thanks for the information about HAProxy. The splice() call is interesting: http://kerneltrap.org/node/6505
z8000