ansaurus

Question

Weird bug with interupted system call, that I can't debug

Answer 1

A:

I am not sure but there seems to be something related here. Apparently,

It's a long standing bug reported regularly but so far nobody has tracked it down. That's mostly because most people cannot really reproduce it.

Something related to an exhaustion of resources that triggers a signal which gets passed to userland (should be captured at kernel level).

This might be related to superuser.com, however here's my best hint:

does it always trigger on the same machine?
what happens if you change the finishing order?
and if you try with a smaller window size? or bigger?

Also see here

lorenzog 2010-09-14 15:12:45

finishing order? Yes it does always trigger on the machine, and only this one. Reboots don't change anything, system is fully updated (Debian Lenny on both sides). Thanks for the link, but modifying the kernel isn't something that I can do very easily.

Let_Me_Be 2010-09-14 15:18:53

I understand, I'm just saying it might be some timing issues, i.e. the receiving end can't cope with the flow or can't close enough TCP ports in time to get the new data or has too many open file descriptors, etc. Also, do you have any extra patch? selinux?

lorenzog 2010-09-14 17:07:11

Answer 2

A:

Are you checking the return value from read? You should be. When it fails, check errno. If it is EINTR, you need to retry the read. (Or if it is one of the values in the links in lorenzog's answer.

Same thing for write, check the return value and errno.

You should also check for short reads/writes and handle this situation. (I.e. getting fewer bytes than you expected.)

bstpierre 2010-09-14 16:27:45

The return values are checked, that's why there is the log write on the end of the strace output. Errno is not checked, the software only logs system errors. Retrying read is not a possibility (this is a huge software). I need to get rid the of the interrupt. Plus the interrupt itself seems to be caused by a timeout (SIGALRM), the real problem is that the read doesn't seem to work and stucks.

Let_Me_Be 2010-09-14 16:43:09

Answer 3

+1 A:

Since this appears to be a timeout on the recieving side of a socket, you could try setting the TCP_NODELAY socket option on the sending side.

caf 2010-09-15 03:25:48

Unfortunately no, I can't do that. The timeouts need to stay as they are. The system needs to stay responsive even in case of network outages. Plus I'm trying to get rid of whatever is causing the execution host to timeout.

Let_Me_Be 2010-09-15 09:37:00

Actually thanks for kicking me to the right direction. It is actually a chain timeout. Execution host is timing out when waiting for the submit tool, because the submit tools is waiting for the server. I still need to find why is the submit tool waiting for the server, but the mystery is resolved.

Let_Me_Be 2010-09-15 12:15:11

@Let_Me_Be: `TCP_NODELAY` won't get rid of the timeouts - it just ensures that data is sent on the network as soon as its written.

caf 2010-09-15 23:29:12

ansaurus

tags:

views:

answers:

Weird bug with interupted system call, that I can't debug

related questions