views:

70

answers:

1

We have a Webstart client that communicates to the server by sending serialized objects over HTTPS using java.net.HttpsURLConnection.

Everything works perfectly fine on my local machine and on test servers located in our office, but I'm experiencing a very, very strange issue which is only occurring on our production and staging servers (and sporadically at that). The main difference I know of between those servers and the ones in our office is that they are located elsewhere and client-server communication with them is considerably slower, but it worked fine for a long time in production prior to this as well.

Anyway, here's what's happening:

  • The client, after setting options such as read timeout and properties such as Content-Type on the HttpURLConnection, calls getOutputStream() on it to get the stream to write to.
  • At this point, from what I can tell, the client hangs for some period of time.
  • The client then throws the following exception:
java.net.ConnectException: Connection timed out: connect
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.PlainSocketImpl.doConnect(Unknown Source)
    at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
    at java.net.PlainSocketImpl.connect(Unknown Source)
    at java.net.SocksSocketImpl.connect(Unknown Source)
    at java.net.Socket.connect(Unknown Source)
    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.connect(Unknown Source)
    at com.sun.net.ssl.internal.ssl.BaseSSLSocketImpl.connect(Unknown Source)
    at sun.net.NetworkClient.doConnect(Unknown Source)
    at sun.net.www.http.HttpClient.openServer(Unknown Source)
    at sun.net.www.http.HttpClient.openServer(Unknown Source)
    at sun.net.www.protocol.https.HttpsClient.(Unknown Source)
    at sun.net.www.protocol.https.HttpsClient.New(Unknown Source)
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(Unknown Source)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(Unknown Source)

Note that this is not a SocketTimeoutException, which the connect() method on HttpURLConnection says it throws if the timeout expires before a connection can be established. Also, when this happens I am able to call conn.getResponseCode() and I get a response code of 200.

  • On the server side, an EOFException is thrown in ObjectInputStream's constructor, which tries to read the serialization header but fails because the client never gets the OutputStream to write to.

In case it helps, here are the calls being made on the HttpsURLConnection prior to the call to getOutputStream() (edited to show only the calls being made rather than the whole structure of the code doing this):

HttpsURLConnection conn = (HttpsURLConnection) url.openConnection();
conn.setUseCaches(false);
conn.setReadTimeout(30000);
conn.setRequestProperty("Cookie", cookie);
conn.setDoOutput(true);
conn.setRequestProperty("Content-Type", "application/x-java-serialized-object");
conn.getOutputStream();

The thing is, I have no idea how any of this could be happening, especially given that it only happens occasionally (no clear pattern of activity that I can tell) and even then only when there's (relatively) high latency between the client and the server.

Given what I've been able to find so far about java.net.ConnectException: Connect timed out, I wondered if it weren't some network or firewall issue on the network our servers are running on... but that doesn't make much sense to me given that the request is clearly getting through to the servlet. Also, other apps running on the same network have not reported similar issues.

Does anyone have any idea what the cause of this could be, or even what I should investigate?

+4  A: 

We have come across these in a similar case to yours. Usually at high load and not easy to reproduce on test. Have not fixed it yet but this is the steps we went through.

If it's a firewall issue, we would get a Connection Refused or the SocketTimeout exception.

1) Are you able to track these requests in the access log on the server - do they show an HTTP status 200 or 404 or something else? In our case, the server (IIS in this case) logs showed the client closed the connection and not the server. So that was a mystery.

Update: If the client always gets a 200, then the server has actually sent back some response but I suspect the response byte-size (if this is recorded in the access logs) will show a different value from that of the normal response size for that request.

If it shows the same size of response, then you have a (may be not plausible) condition that the server actually responded correctly but the client did not get the response back because the connection terminated somewhere in between.

2) The network admin teams looked at the TCP/IP traffic to determine which end (or intermediate router) is terminating the HTTP / TCP-IP conversation. And once we understand which end is terminating the connection is to look at why. Someone knowledgable enough could run snoop

3) Is there a max number of requests configured/restricted on the server - and is that throttling your connections?

4) Are there any intermediate load balancers at which requests could be dropped?

Update: One more thing we wanted to, but did not complete is to create a static route between client and server to reduce the number of hops in between and ensure no network related connection drops. See http://en.wikipedia.org/wiki/Static_routing

5) Another suggestion is setting the ConnectTimeout too to see if these work with a higher value. Update: You might want to try conn.getErrorStream()

Returns the error stream if the connection failed but the server sent useful data nonetheless. If the connection was not connected, or if the server did not have an error while connecting or if the server had an error but no error data was sent, this method will return null.

6) Could also try taking a set of thread dumps on the server 5 seconds apart, to see if any thread shows these incoming requests on the server.

Update: As of today we learnt to live with this problem, because we totalled the failure rate to be 200-300 out of 400,000 requests per day which is 0.00075 %

JoseK
Thanks for your answer. I'm not sure about the server access logs, but I did edit the question to note that the client sees a response code of 200 after catching the exception. I've experimented with the connect timeout value, but from what I could tell, a `SocketTimeoutException` is thrown when that is exceeded (rather than a `ConnectException`). I'm not sure about any of the other things, but they all seem worth investigating.
ColinD
@ColinD: Does conn.getErrorStream() as in my update show anything interesting?
JoseK
@JoseK: I haven't had a chance to try that yet, though given what's happening on the server side, it would not be writing anything to stream back to the client.
ColinD
I ended up having to just set a lowish connect timeout and call `URLConnection.connect()` in such a way that I can retry it a few times if it times out. Not ideal, but we haven't been able to determine what exactly is causing this.
ColinD