tags:

views:

58

answers:

3

We have an application that uses two types of socket, a listening UDP socket and an active SCTP socket.

At certain time we have scripts running on the same machine that have high IO activities (such as "dd, tar, ..."), most of the time when these IO heavy applications run we seem to have the following problems:

  • The UDP socket closes
  • The SCTP socket is still alive and we can see it in /proc/net/sctp/assocs however no traffic is received anymore from this socket (until we restart the application)

Why are these I/O operations affecting the network based application in such a way?
Is there any kernel configurations to avoid these problems?
I would have expected some packets to be lost on the UDP and some retries on the SCTP socket but not this behavior.

The application is running on a server with 64-bits 4 quad core CPU and RHEL OS

# uname -a
Linux server1 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
+1  A: 

When you say the UDP socket closes, what exactly do you mean? You try send and it fails?

For SCTP, can you collect wireshark or pcap traces at the time these I/O operations runs (preferably run wireshark on the peer)? My guess is (an educated guess without looking at the code), when these I/O operations comes into the picture, your process gets starved for CPU time. The other end sends SCTP Heartbeat messages to which it gets no replies. Or if data was flowing, the peer end is not receiving any SACKS as they have not yet been processed by the SCTP stack at your end.

The peer, therefore, aborts the association internally and stops sending you data (since it sees all the paths as down ergo does not send ABORT. In such a case, your SCTP stack will still think Association is alive). Try to confirm what are the values for Heartbeat timeout, RTO timeout,SACK timeout, maximum Path retransmission & max Association retransmission at the peer end. I haven't worked with Kernel SCTP but sysctl should be able to give you those values.

Either ways, collecting pcap traces when you observe this problem would give us much better insight to what is going wrong. I hope it helps.

Aditya Sehgal
I wouldn't have though the other peer actually stopped sending data! We'll try running wireshark and get back with results (running it on the other peer won't be easy though) ... btw we had tried isolating the apps using taskset (to make sure it gets cpu time) but that did not help.
Pat
the only reason i suggest taking wireshark at the peer end is if your app is not getting CPU time when those I/O operations are triggered, there is little chance wireshark would. So, you might see wireshark in a hung state which would defeat the whole purpose. You can only run the wireshark if you have an intermediate node
Aditya Sehgal
Wireshark can see traffic, and SACKS are being sent back!
Pat
humm...that means the kernel is putting msgs in the socket queue for your app to read but your app isnt reading them! Your socket isnt closing then (just to be sure, run netstat -nap | grep <port number of your socket>). Are you using blocking or non-blocking sockets?
Aditya Sehgal
A: 

Here are some things I'd look into:

What is loading on the UDP socket when the scripts are not running? Is it continuous or bursty? Does the socket ever spontaneously close when the scripts are not running? What is happening to the data being read off the socket? How much data generated off of the socket (raw or processed) is being written to disk? Can you monitor CPU, network, and disk IO utilization to see if any of them are saturating? Can the scripts running the IO operations be run at a lower priority or, conversely, can the process running the UDP socket be run at a higher priority?

sizzzzlerz
Changing the default priority of our script (and the other jobs running) is the first thing we tried. Unfortunately it did not help much. Also in case it matters we could not reproduce the problem by running other apps that consume lots of cpu without doing much IO.
Pat
A: 

One thing allot of people don't check for is return values on sends, and they don't check for error conditions like EINTR on recv's. Maybe the heavy IO load is causing some of your send's or recv's to get interrupted and your app is seeing the errors as a hard errors and closing the socket without you realizing that the errors are transient.

I've seen this kind of thing happen and you should definitely check for it by cranking up your log level and seeing if your app is calling close unexpectedly.

Robert S. Barnes
We'll try that next.
Pat