views:

3142

answers:

6

I have a web-based application and a client, both written in Java. For what it's worth, the client and server are both on Windows. The client issues HTTP GETs via Apache HttpClient. The server blocks for up to a minute and if no messages have arrived for the client within that minute, the server returns HTTP 204 No Content. Otherwise, as soon as a message is ready for the client, it is returned with the body of an HTTP 200 OK.

Here is what has me puzzled: Intermittently for a specific subset of clients -- always clients with demonstrably flaky network connections -- the client issues a GET, the server receives and processes the GET, but the client sits forever. Enabling debugging logs for the client, I see that HttpClient is still waiting for the very first line of the response.

There is no Exception thrown on the server, at least nothing logged anywhere, not by Tomcat, not by my webapp. According to debugging logs, there is every sign that the server successfully responded to the client. However, the client shows no sign of having received anything. The client hangs indefinitely in HttpClient.executeMethod. This becomes obvious after the session times out and the client takes action that causes another Thread to issue an HTTP POST. Of course, the POST fails because the session has expired. In some cases, hours have elapsed between the session expiring and the client issuing a POST and discovering this fact. For this entire time, executeMethod is still waiting for the HTTP response line.

When I use WireShark to see what is really going on at the wire level, this failure does not occur. That is, this failure will occur within a few hours for specific clients, but when WireShark is running at both ends, these same clients will run overnight, 14 hours, without a failure.

Has anyone else encountered something like this? What in the world can cause it? I thought that TCP/IP guaranteed packet delivery even across short term network glitches. If I set an SO_TIMEOUT and immediately retry the request upon timeout, the retry always succeeds. (Of course, I first abort the timed-out request and release the connection to ensure that a new socket will be used.)

Thoughts? Ideas? Is there some TCP/IP setting available to Java or a registry setting in Windows that will enable more aggressive TCP/IP retries on lost packets?

A: 

Could these computers have a virus/malware installed? Using wireshark installs winpcap (http://www.winpcap.org/) which may be overriding the changes the malware made (or the malware may simply detect it is being monitored and not attempt anything fishy).

BarrettJ
I hadn't considered this, but it is remotely possible, of course. Since I only see this on clients with a flaky network connection, I have so far assumed that the flakiness itself is somehow the cause.
Eddie
Malware is remotely possible, but very unlikely. Go with what you already know - flakiness.
Gary
+1  A: 

I haven't seen this one per se but I have seen similar problems with large UDP datagrams causing IP fragmentation which lead to congestion and ultimately dropped Ethernet frames. Since this is TCP/IP I wouldn't expect IP fragmentation to be a large issue since it is a stream-based protocol.

One thing that I will note is that TCP does not guarantee delivery! It can't. What it does guarantee is that if you send byte A followed by byte B, then you will never receive byte B before you have received byte A.

With that said, I would connect the client machine and a monitoring machine to a hub. Run Wireshark on the monitoring machine and you should be able to see what is going on. I did run into problems related to both whitespace handling between HTTP requests and incorrect HTTP chunk sizes. Both issues were due to a hand written HTTP stack so this is only a problem if you are using a flaky stack.

D.Shawley
+1  A: 

Forgetting to flush or close the socket on the host side can intermittently have this effect for short responses depending on timing which could be affected by the presence of any monitoring mechanism.

Especially forgetting to close will leave the socket dangling until GC gets around to reclaiming it and calls finalize().

Software Monkey
A: 

If you are losing data, it is most likely due to a software bug, either in the reading or writing library.

Peter Lawrey
+6  A: 

Are you absolutely sure that the server has successfully sent the response to the clients that seem to fail? By this I mean the server has sent the response and the client has ack'ed that response back to the server. You should see this using wireshark on the server side. If you are sure this has occured on the server side and the client still does not see anything, you need to look further up the chain from the server. Are there any proxy/reverse proxy servers or NAT involved?

The TCP transport is considered to be a reliable protocol, but it does not guarantee delivery. The TCP/IP stack of your OS will try pretty hard to get packets to the other end using TCP retransmissions. You should see these in wireshark on the server side if this is happening. If you see excessive TCP retransmissions, it is usually a network infrastructure issue - i.e. bad or misconfigured hardware/interfaces. TCP retransmissions works great for short network interruptions, but performs poorly on a network with a longer interruption. This is because the TCP/IP stack will only send retransmissions after a timer expires. This timer typically doubles after each unsuccessful retransmission. This is by design to avoid flooding an already problematic network with retransmissions. As you might imagine, this usually causes applications all sorts of timeout issues.

Depending on your network topology, you may also need to place probes/wireshark/tcpdump at other intermediate locations in the network. This will probably take some time to find out where the packets have gone.

If I were you I would keep monitoring with wireshark on all ends until the problem re-occurs. It mostly likely will. But, it sounds like what you will ultimately find is what you already mentioned - flaky hardware. If fixing the flaky hardware is out of the question, you may need to just build in extra application level timeouts and retries to attempt to deal with the issue in software. It sounds like you started going down this path.

Gary
All I can tell from the debugging in place when it has occurred is that my web app believes it has responded. I didn't enable any debugging in Tomcat (6.x) itself to see whether it believed it had completed the response. There were no complaints in Tomcat's log, nor Apache HTTPD's log, nor mod_jk's log. Flaky hardware is entirely out of my hands ... in some cases people are going across the public internet.
Eddie
There is no substitute for hard information. Wireshark will tell you who's talking and who's not.
Hans Malherbe
+2  A: 

If you are using long running GETs, you should timeout on the client side at twice the server timeout, as you have discovered.

On a TCP where the client send a message and expects a response, if the server were to crash, and restart (lets say for the point of examples) then the client would still be waiting on the socket to get a response from the Server yet the server is no longer listening on that socket.

The client will only discover the socket is closed on the server end once it sends more data on that socket, and the server rejects this new data, and closes the socket.

This is why you should have client side time-outs on requests.

But as your server is not crashing, if the server was multi threaded, and thread socket for that client closed, but at that time ( duration minutes) the client has an connectivity outage, then the end socket hand-shaking my be lost, and as you are not sending more data to the server from the client, your client is once again left hanging. This would tie in to your flaking connection observation.

Simeon Pilgrim