views:

525

answers:

3

I have an application that consists of two processes (let's call them A and B), connected to each other through Unix domain sockets. Most of the time it works fine, but some users report the following behavior:

  1. A sends a request to B. This works. A now starts reading the reply from B.
  2. B sends a reply to A. The corresponding write() call returns an EPIPE error, and as a result B close() the socket. However, A did not close() the socket, nor did it crash.
  3. A's read() call returns 0, indicating end-of-file. A thinks that B prematurely closed the connection.

Users have also reported variations of this behavior, e.g.:

  1. A sends a request to B. This works partially, but before the entire request is sent A's write() call returns EPIPE, and as a result A close() the socket. However B did not close() the socket, nor did it crash.
  2. B reads a partial request and then suddenly gets an EOF.

The problem is I cannot reproduce this behavior locally at all. I've tried OS X and Linux. The users are on a variety of systems, mostly OS X and Linux.

Things that I've already tried and considered:

  • Double close() bugs (close() is called twice on the same file descriptor): probably not as that would result in EBADF errors, but I haven't seen them.
  • Increasing the maximum file descriptor limit. One user reported that this worked for him, the rest reported that it did not.

What else can possibly cause behavior like this? I know for certain that neither A nor B close() the socket prematurely, and I know for certain that neither of them have crashed because both A and B were able to report the error. It is as if the kernel suddenly decided to pull the plug from the socket for some reason.

A: 
  • shutdown() may have been called on one of the socket endpoints.

  • If either side may fork and execute a child process, ensure that the FD_CLOEXEC (close-on-exec) flag is set on the socket file descriptor if you did not intend for it to be inherited by the child. Otherwise the child process could (accidentally or otherwise) be manipulating your socket connection.

mark4o
Thanks, but neither situations are applicable to my program.
Hongli
A: 

I would also check that there's no sneaky firewall in the middle. It's possible an intermediate forwarding node on the route sends an RST. The best way to track that down is of course the packet sniffer (or its GUI cousin.)

Nikolai N Fetissov
... on a UNIX domain socket? That's a local-only protocol.
ephemient
Oh ... shoot, I totally missed that. Thanks.
Nikolai N Fetissov
A: 

Perhaps you could try strace as described in: http://modperlbook.org/html/6-9-1-Detecting-Aborted-Connections.html

I assume that your problem is related to the one described here: http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable

Unfortunately I'm having a similar problem myself but couldn't manage to get it fixed with the given advices. However, perhaps that SO_LINGER thing works for you.

It turned out that the server's file descriptor was added with the EPOLLET flag to the epoll queue which seems to be wrong.