tags:

views:

62

answers:

3

I'm working on a batch system (Torque), now the important part is this: When an interactive job is run, the submit tool will communicate with the execution host. This fails on one of the machines.

Here is the strace output:

Sumbit tool:

16:18:36.219925 fcntl(4, F_GETFL)       = 0x2 (flags O_RDWR)                                                                                                          
16:18:36.219925 read(4, "610.torque1.ics.muni.cz\0\0\0\0\0\0\0\0\0\0"..., 16385) = 1046                                                                               
16:18:36.219925 write(4, "TERM=xterm\0\0\0\0\0\0\220\5\377\377\377\177\0\0\214\303u\310\277\177\0\0\26"..., 80) = 80                                                  
16:18:36.219925 write(4, "\3\34\177\25\4\32"..., 6) = 6                                                                                                               
16:18:36.219925 write(4, "WINSIZE 46,166,0,0\0\0\0\0\0\0\1@\0\0\0\0\0\0\0"..., 80) = 80                                                                               
16:18:36.219925 write(1, "qsub: job 610.torque1.ics.muni.cz"..., 41qsub: job 610.torque1.ics.muni.cz ready  ) = 41
16:18:36.219925 rt_sigaction(SIGINT, {SIG_IGN}, NULL, 8) = 0
16:18:36.219925 rt_sigaction(SIGTERM, {SIG_IGN}, NULL, 8) = 0
16:18:36.219925 rt_sigaction(SIGALRM, {SIG_IGN}, NULL, 8) = 0
16:18:36.219925 rt_sigaction(SIGTSTP, {SIG_IGN}, NULL, 8) = 0
16:18:36.219925 clone(Process 20724 attached child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fbfc9a2d770) = 20724

Execution host part:

[pid  8778] 15:59:16.371145 getsockopt(3, SOL_SOCKET, SO_ERROR, [4294967296], [4]) = 0                                                                                
[pid  8778] 15:59:16.371145 fcntl(3, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)                                                                                       
[pid  8778] 15:59:16.371145 fcntl(3, F_SETFL, O_RDWR) = 0                                                                                                             
[pid  8778] 15:59:16.371145 write(3, "609.torque1.ics.muni.cz\0\0\0\0\0\0\0\0\0\0"..., 1046) = 1046                                                                   
[pid  8778] 15:59:16.371145 fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)                                                                                                    
[pid  8778] 15:59:16.375144 read(3, 0x717ae0, 80) = ? ERESTARTSYS (To be restarted)                                                                                   
[pid  8778] 15:59:21.367024 --- SIGALRM (Alarm clock) @ 0 (0) ---                                                                                                     
[pid  8778] 15:59:21.367024 rt_sigreturn(0x8) = -1 EINTR (Interrupted system call)                                                                                    
[pid  8778] 15:59:21.367024 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0                                                          
[pid  8778] 15:59:21.367024 write(2, "pbs_mom: LOG_ERROR::Interrupted s"

Edit: These are actually separate runs, but the output is always exactly the same.

A: 

I am not sure but there seems to be something related here. Apparently,

It's a long standing bug reported regularly but so far nobody has tracked it down. That's mostly because most people cannot really reproduce it.

Something related to an exhaustion of resources that triggers a signal which gets passed to userland (should be captured at kernel level).

This might be related to superuser.com, however here's my best hint:

  • does it always trigger on the same machine?
  • what happens if you change the finishing order?
  • and if you try with a smaller window size? or bigger?

Also see here

lorenzog
finishing order? Yes it does always trigger on the machine, and only this one. Reboots don't change anything, system is fully updated (Debian Lenny on both sides). Thanks for the link, but modifying the kernel isn't something that I can do very easily.
Let_Me_Be
I understand, I'm just saying it might be some timing issues, i.e. the receiving end can't cope with the flow or can't close enough TCP ports in time to get the new data or has too many open file descriptors, etc. Also, do you have any extra patch? selinux?
lorenzog
A: 

Are you checking the return value from read? You should be. When it fails, check errno. If it is EINTR, you need to retry the read. (Or if it is one of the values in the links in lorenzog's answer.

Same thing for write, check the return value and errno.

You should also check for short reads/writes and handle this situation. (I.e. getting fewer bytes than you expected.)

bstpierre
The return values are checked, that's why there is the log write on the end of the strace output. Errno is not checked, the software only logs system errors. Retrying read is not a possibility (this is a huge software). I need to get rid the of the interrupt. Plus the interrupt itself seems to be caused by a timeout (SIGALRM), the real problem is that the read doesn't seem to work and stucks.
Let_Me_Be
+1  A: 

Since this appears to be a timeout on the recieving side of a socket, you could try setting the TCP_NODELAY socket option on the sending side.

caf
Unfortunately no, I can't do that. The timeouts need to stay as they are. The system needs to stay responsive even in case of network outages. Plus I'm trying to get rid of whatever is causing the execution host to timeout.
Let_Me_Be
Actually thanks for kicking me to the right direction. It is actually a chain timeout. Execution host is timing out when waiting for the submit tool, because the submit tools is waiting for the server. I still need to find why is the submit tool waiting for the server, but the mystery is resolved.
Let_Me_Be
@Let_Me_Be: `TCP_NODELAY` won't get rid of the timeouts - it just ensures that data is sent on the network as soon as its written.
caf