views:

364

answers:

5

I have a Java program running on Windows (a Citrix machine), that dispatches a request to Java application servers on Linux; this dispatching mechanism is all custom.

The Windows Java program (let's call it W) opens a listen socket to a port given by the OS, say 1234 to receive results. Then it invokes a "dispatch" service on the server with a "business request". This service splits the request and sends it to other servers (let's call them S1 ... Sn), and returns the number of jobs to the client synchronously.

In my tests, there are 13 jobs, dispatched to a number of servers and within 2 seconds, all servers have finished processing their jobs and try to send the results back to the W's socket.

I can see in the logs that 9 jobs are received by W (this number varies from test to test). So, I try to look for the 4 remaining jobs. If I do a netstat on this Windows box, I see that 4 sockets are open:

TCP    W:4373       S5:48197  ESTABLISHED
TCP    W:4373       S5:48198  ESTABLISHED
TCP    W:4373       S6:57642  ESTABLISHED
TCP    W:4373       S7:48295  ESTABLISHED

If I do a thread dump of W, I see 4 threads trying to read from these sockets, and apparently stuck in java.net.SocketInputStream.socketRead0(Native Method).

If I go on each of the S boxes and do a netstat, I see that some bytes are still in the Send Queue. This number of bytes does not move for 15 minutes. (The following is the aggregation of netstats on the different machines):

Proto Recv-Q Send-Q Local Address               Foreign Addr   State
tcp        0   6385 S1:48197                          W:4373   ESTABLISHED
tcp        0   6005 S1:48198                          W:4373   ESTABLISHED
tcp        0   6868 S6:57642                          W:4373   ESTABLISHED
tcp        0   6787 S7:48295                          W:4373   ESTABLISHED

If I do a thread dump of the servers, I see the threads are also stuck in java.net.SocketInputStream.socketRead0(Native Method). I would expect a write, but maybe they're waiting for an ACK? (Not sure here; would it show in Java? Shouldn't it be handled by the TCP protocol directly?)

Now, the very strange thing is: after 15 minutes (and it's always 15 minutes), the results are received, sockets are closed, and everything continues as normal.

This used to always work before. The S servers moved to a different data center, so W and S are no longer in the same data center. Also, S is behind a firewall. All ports should be authorized between S and W (I'm told). The mystery is really the 15 minute delay. I thought that it could be some protection against DDOS?

I'm no network expert so I asked for help, but nobody's available to help me. I spent 30 minutes with a guy capturing packets with Wireshark (formerly Ethereal), but for "security reasons," I cannot look at the result. He has to analyze this and get back to me. I asked for the firewall logs; same story.

I'm not root or administrator on these boxes, now I don't know what to do... I'm not expecting a solution from you guys, but some ideas on how to progress would be great!

+1  A: 

Are you missing a flush() on the S side after sending the response?

Peter
No, the same code is executed for the others, and it works well. It also works on other environments. It's been working well in the past too. It's definitely a network issue.
Nicolas
+1  A: 

Right. If you're using a BufferedOutputStream you need to call flush() unless you reach the max buffer size.

Nick
Flush is called.
Nicolas
+2  A: 

If it worked ok in your local network, then I don't envisage this being a programming issue (re. the flush() comments).

Is network connectivity between the 2 machines normal otherwise ? Can you transfer similar quantities of data via (say) FTP with no problem. Can you replicate this issue by knocking together a client/server script just to send appropriately sized chunks of data. i.e. is the network connectivity good between W and S ?

Another question. You now have a firewall inbetween. Could this be a possible bottleneck that wasn't there before ? (not sure how that would explain the consistent 15m delay though).

Final question. What are your TCP configuration parameters set up to be (on both W and S - I'm thinking about the OS-level parameters). Is there anything there that would suggest or lead to a 15m figure.

Not sure if that's any help.

Brian Agnew
+1  A: 

Apart from trying that Brian said, you could also check the following

1) Run tcpdump on any one of the servers, and see the sequence of message flows from the time when a job is initiated to after the delay, when all processing is complete. That will tell you which side is causing the delay (W or S). Check if there are any retransmissions, missed acks, and so on.

2) Is there some kind of fragmentation happening between W and S?

3) What are the network load conditions on the servers on which the bytes are stuck? Is heavy load causing output errors, resulting in socket queues not being emptied? (There could also be a NIC bug, wherein after hitting some error condition, the NIC buffers are not flushed, or fails to resume transmission, and such a condition is getting cleared by some sort of a watchdog)

More information on the above two would definitely help.

Harty
Don't have sufficient priviledges to run tcpdump, but that's what I'm trying to do with the network guys. Not sure about the network conditions. On the subnetworks where W and S are, the load is low, but there might be a bottleneck in a router where the packets go through.
Nicolas
A: 

Are you sure that the threads stuck in read calls are the same threads that were sending the data ? Is it possible that the threads actually involved are instead blocked on some other activity, and your stackdump shows other innocent threads that just happen to be doing socket i/o ? It's been a while since I worked with Java, but I vaguely remember the JVM using sockets for IPC.

I would examine all the receiving side to see if one of them is the intended receiver and is instead doing something else for 15 minutes.

The fact that it works in one location vs another usually points to an application timing error, not a datacenter problem.

Yes I'm sure. The threads in question are in a dedicated threadpool and are named differently. I don't understand the logic behind the timing leading to the application. But I agree 15 minutes is quite a lot for a network timeout.
Nicolas
hmm,check for closed receive TCP windows; that indicates an application problem. Sniff on each host physically to get the true picture of the traffic from end-to-end. break the windows app in a debugger for 20 min to see if the other servers finish up anyway.