I have an application that consists of two processes (let's call them A and B), connected to each other through Unix domain sockets. Most of the time it works fine, but some users report the following behavior:
- A sends a request to B. This works. A now starts reading the reply from B.
- B sends a reply to A. The corresponding write() call returns an EPIPE error, and as a result B close() the socket. However, A did not close() the socket, nor did it crash.
- A's read() call returns 0, indicating end-of-file. A thinks that B prematurely closed the connection.
Users have also reported variations of this behavior, e.g.:
- A sends a request to B. This works partially, but before the entire request is sent A's write() call returns EPIPE, and as a result A close() the socket. However B did not close() the socket, nor did it crash.
- B reads a partial request and then suddenly gets an EOF.
The problem is I cannot reproduce this behavior locally at all. I've tried OS X and Linux. The users are on a variety of systems, mostly OS X and Linux.
Things that I've already tried and considered:
- Double close() bugs (close() is called twice on the same file descriptor): probably not as that would result in EBADF errors, but I haven't seen them.
- Increasing the maximum file descriptor limit. One user reported that this worked for him, the rest reported that it did not.
What else can possibly cause behavior like this? I know for certain that neither A nor B close() the socket prematurely, and I know for certain that neither of them have crashed because both A and B were able to report the error. It is as if the kernel suddenly decided to pull the plug from the socket for some reason.