I have a web-based application and a client, both written in Java. For what it's worth, the client and server are both on Windows. The client issues HTTP GETs via Apache HttpClient. The server blocks for up to a minute and if no messages have arrived for the client within that minute, the server returns HTTP 204 No Content. Otherwise, as soon as a message is ready for the client, it is returned with the body of an HTTP 200 OK.
Here is what has me puzzled: Intermittently for a specific subset of clients -- always clients with demonstrably flaky network connections -- the client issues a GET, the server receives and processes the GET, but the client sits forever. Enabling debugging logs for the client, I see that HttpClient is still waiting for the very first line of the response.
There is no Exception thrown on the server, at least nothing logged anywhere, not by Tomcat, not by my webapp. According to debugging logs, there is every sign that the server successfully responded to the client. However, the client shows no sign of having received anything. The client hangs indefinitely in HttpClient.executeMethod. This becomes obvious after the session times out and the client takes action that causes another Thread to issue an HTTP POST. Of course, the POST fails because the session has expired. In some cases, hours have elapsed between the session expiring and the client issuing a POST and discovering this fact. For this entire time, executeMethod
is still waiting for the HTTP response line.
When I use WireShark to see what is really going on at the wire level, this failure does not occur. That is, this failure will occur within a few hours for specific clients, but when WireShark is running at both ends, these same clients will run overnight, 14 hours, without a failure.
Has anyone else encountered something like this? What in the world can cause it? I thought that TCP/IP guaranteed packet delivery even across short term network glitches. If I set an SO_TIMEOUT and immediately retry the request upon timeout, the retry always succeeds. (Of course, I first abort the timed-out request and release the connection to ensure that a new socket will be used.)
Thoughts? Ideas? Is there some TCP/IP setting available to Java or a registry setting in Windows that will enable more aggressive TCP/IP retries on lost packets?