views:

65

answers:

2

Hi there,

This is just a general question relating to some high-performance computing I've been wondering about. A certain low-latency messaging vendor speaks in its supporting documentation about using raw sockets to transfer the data directly from the network device to the user application and in so doing it speaks about reducing the messaging latency even further than it does anyway (in other admittedly carefully thought-out design decisions).

My question is therefore to those that grok the networking stacks on Unix or Unix-like systems. How much difference are they likely to be able to realise using this method? Feel free to answer in terms of memory copies, numbers of whales rescued or areas the size of Wales ;)

Their messaging is UDP-based, as I understand it, so there's no problem with establishing TCP connections etc. Any other points of interest on this topic would be gratefully thought about!

Best wishes,

Mike

+1  A: 

To reduce latency in High-performance, you should decline to use a kernel driver. Smallest latency will be achieved with user-space drivers (MX does it, Infinband may be too).

There is a rather good (but slightly outdated) overview of linux networking internals "A Map of the Networking Code in Linux Kernel 2.4.20". There are some schemes of TCP/UDP datapath.

Using raw sockets will make path of tcp packets a bit shorter (thanks for an idea). TCP code in kernel will not add its latency. But user must handle all tcp protocol itself. There is a some chance of optimizing it for some specific situations. Code for clusters don't require handling of long distance links or slow links as for default TCP/UDP stack.

I'm very interested in this theme too.

osgx
networking internals from 2.4.20 (NAPI) is still here for 2.6. But there is new sendfile(sendpage)/splice interfaces for eliminate copies.
osgx
It seems like a very interesting topic. I'm also interested in it from the perspective of a Java engineer - to what extent can networking performance (throughput/latency/no GC) by handing this off to a native high-performance networking implementation. Having read that Java was conceived in part as a networking language, I was slightly surprised to read a paper recently that decried the JVM's networking-copying inefficiencies, though this was at least in part in regard to JNI. Perhaps one future direction for the JVM could be to do something special with some of the target OS' networking code.
Michael_73
Incidentally, Stevens' book "Unix Network Programming" has a neat way of stopping the OS sending an RST if you're trying to receive TCP packets through a PF_PACKET socket or BPF/pcap/libnet variant.
Michael_73
@Michael_73, interesting... can you give more precise link to this feature? I'm also don't know for now, how does OS filter incoming packets to distinguish packets to RAW socket from other packages.
osgx
@osgx Got lucky on Google books - they've not removed that particular page. See the single indented paragraph at the bottom of page 794 http://books.google.co.uk/books?id=ptSC4LpwGA0C)
Michael_73
+1  A: 

There are some pictures http://vger.kernel.org/~davem/tcp_output.html Googled with tcp_transmit_skb() which is a key part of tcp datapath. There are some more interesting thing on his site http://vger.kernel.org/~davem/

In user - tcp transmit part of datapath there is 1 copy from user to skb with skb_copy_to_page (when sending by tcp_sendmsg()) and 0 copy with do_tcp_sendpages (called by tcp_sendpage()). Copy is needed to keep a backup of data for case of undelivered segment. skb buffers in kernel can be cloned, but their data will stay in first (original) skb. Sendpage can take a page from other kernel part and keep it for backup (i think there is smth like COW)

Call paths (manually from lxr). Sending tcp_push_one/__tcp_push_pending_frames

tcp_sendmsg() <-  sock_sendmsg <- sock_readv_writev <- sock_writev <- do_readv_writev

tcp_sendpage() <- file_send_actor <- do_sendfile 

Receive tcp_recv_skb()

tcp_recvmsg() <-  sock_recvmsg <- sock_readv_writev <- sock_readv <- do_readv_writev

tcp_read_sock() <- ... spliceread for new kernels.. smth sendfile for older

In receive there can be 1 copy from kernel to user skb_copy_datagram_iovec (called from tcp_recvmsg). And for tcp_read_sock() there can be copy. It will call sk_read_actor callback function. If it correspond to file or memory, it may (or may not) copy data from DMA zone. If it is a other network, it has an skb of received packet and can reuse its data inplace.

For udp - receive = 1 copy -- skb_copy_datagram_iovec called from udp_recvmsg. transmit = 1 copy -- udp_sendmsg -> ip_append_data -> getfrag (seems to be ip_generic_getfrag with 1 copy from user, but may be a smth sendpage/splicelike without page copiing.)

Generically speaking there is must be at least 1 copy when sending from/receiving to userspace and 0 copy when using zero-copy (surprise!) with kernel-space source/target buffers for data. All headers are added without moving a packet, DMA-enabled (all modern) network card will take data from any place in DMA-enabled address space. For ancient cards PIO is needed, so there will be one more copy, from kernel space to PCI/ISA/smthelse I/O registers/memory.

UPD: In path from NIC (but this is nic-dependent, i checked 8139too) to tcp stack there is one more copy: from rx_ring to skb and the same for receive: from skb to tx buffer +1copy. You must to fill in ip and tcp header, but does skb contain them or place for them?

osgx
"The Performance Analysis of Linux Networking – Packet Receiving"(thnx to http://hackingnasdaq.blogspot.com/2010/01/myth-of-procsysnetipv4tcplowlatency.html - myth of tcp_low_latency sysctl)
osgx
hackingnasdaq.blogspot.com - this blog is very interesting. There are a lot of posts about low-latency linux networking
osgx
osgx
Wow - superb answer. Too bad you can't mod them up by more than one point... but then I can see where that might lead! Have modded up your other answer too though. Cheers osgx! Spasiba
Michael_73
@Michael_73, I hope this will be a part of my thesis :)
osgx
osgx
@osgx thanks mate, will have a look when I get the chance.
Michael_73
@Michael_73, http://lion.cs.uiuc.edu/courses/cs498hou_spring05/lectures.html good slides for tcp (14-16 lect.)
osgx
@Michael_73, and the best images in the slides are stolen from book "The Linux® Networking Architecture: Design and Implementation of Network Protocols in the Linux Kernel" %)
osgx
@osgx Thanks very much for the links, mate. Awesome stuff.
Michael_73