I have a Java program running on Windows (a Citrix machine), that dispatches a request to Java application servers on Linux; this dispatching mechanism is all custom.
The Windows Java program (let's call it W
) opens a listen socket to a port given by the OS, say 1234 to receive results. Then it invokes a "dispatch" service on the server with a "business request". This service splits the request and sends it to other servers (let's call them S1 ... Sn
), and returns the number of jobs to the client synchronously.
In my tests, there are 13 jobs, dispatched to a number of servers and within 2 seconds, all servers have finished processing their jobs and try to send the results back to the W
's socket.
I can see in the logs that 9 jobs are received by W
(this number varies from test to test). So, I try to look for the 4 remaining jobs. If I do a netstat
on this Windows box, I see that 4 sockets are open:
TCP W:4373 S5:48197 ESTABLISHED
TCP W:4373 S5:48198 ESTABLISHED
TCP W:4373 S6:57642 ESTABLISHED
TCP W:4373 S7:48295 ESTABLISHED
If I do a thread dump of W
, I see 4 threads trying to read from these sockets, and apparently stuck in java.net.SocketInputStream.socketRead0(Native Method)
.
If I go on each of the S
boxes and do a netstat
, I see that some bytes are still in the Send Queue. This number of bytes does not move for 15 minutes. (The following is the aggregation of netstat
s on the different machines):
Proto Recv-Q Send-Q Local Address Foreign Addr State
tcp 0 6385 S1:48197 W:4373 ESTABLISHED
tcp 0 6005 S1:48198 W:4373 ESTABLISHED
tcp 0 6868 S6:57642 W:4373 ESTABLISHED
tcp 0 6787 S7:48295 W:4373 ESTABLISHED
If I do a thread dump of the servers, I see the threads are also stuck in
java.net.SocketInputStream.socketRead0(Native Method)
. I would expect a write, but maybe they're waiting for an ACK? (Not sure here; would it show in Java? Shouldn't it be handled by the TCP protocol directly?)
Now, the very strange thing is: after 15 minutes (and it's always 15 minutes), the results are received, sockets are closed, and everything continues as normal.
This used to always work before. The S
servers moved to a different data center, so W
and S
are no longer in the same data center. Also, S
is behind a firewall. All ports should be authorized between S
and W
(I'm told). The mystery is really the 15 minute delay. I thought that it could be some protection against DDOS?
I'm no network expert so I asked for help, but nobody's available to help me. I spent 30 minutes with a guy capturing packets with Wireshark (formerly Ethereal), but for "security reasons," I cannot look at the result. He has to analyze this and get back to me. I asked for the firewall logs; same story.
I'm not root or administrator on these boxes, now I don't know what to do... I'm not expecting a solution from you guys, but some ideas on how to progress would be great!