views:

205

answers:

0

While trying to debug a network problem, I've run into some difficulties understanding the output of some diagnostic tools.

Context: distributed application over 7 nodes on a Linux cluster. 1GB ethernet, sustained data rate while on operation: 600Mbits per node (up+downstream). The symptom is a blocked TCP connection: sender blocks on a write, receiver on a receive and no data is exchanged for over 20 seconds, until application crashes.

netstat -s shows some failures:

1963106 segments retransmited
49751 packets pruned from receive queue because of socket buffer overrun
2052 packets dropped from prequeue

What do the second and third lines mean? Is the TCP stack dropping packets internally due to buffer overflows? Could this be leading to retransmissions?

Ethtool doesn't show any dropped packets on the interface:

 rx_packets: 2736174581
 tx_packets: 2534176576
 rx_bytes: 3086961874562
 tx_bytes: 1945882438598
 rx_errors: 207
 tx_errors: 0
 rx_dropped: 0
 tx_dropped: 0
 collisions: 0
 rx_length_errors: 35
 rx_over_errors: 0
 rx_crc_errors: 86

But I'm not sure of how to interpret tx/rx_dropped. Do this mean dropped packets on the interface or on the network? I don't see how the interface could know about dropped packets on the network, so I assume is on the interface. When could this happen? Running out of internal buffers?

Another suspicious indication is the high number of "requeues" in qdisc:

# tc -s -d qdisc
qdisc pfifo_fast 0: dev eth0 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 851772309087 bytes 1187617232 pkt (dropped 0, overlimits 0 requeues 2192465)
rate 0bit 0pps backlog 0b 0p requeues 2192465

When does a requeue happen and what could be causing this?

Thanks in advance, Nuno