views:

622

answers:

5

I have a program which:

  • has a main thread (1) which starts a server thread (2) and another (4).
  • the server thread (2) does an accept(), then creates a new thread (3) to handle the connection.

At some point, thread (4) does a fork/exec to run another program which should connect to the socket that thread (2) is listening to. Occasionally this fails or takes an unreasonably long time, and it's extremely difficult to diagnose. If I strace the system, it appears that the fork/exec has worked, the accept has happened, the new thread (4) has been created .. but nothing happens in that thread (using strace -ff, the file for the relevant pid is blank).

Any ideas?

A: 

Decrease the code to the smallest possible size that still has the behavior and post it here. Either you will find the answer or we will be able to track it down.

BTW - http://lists.samba.org/archive/linux/2002-February/002171.html it seems that pthread behavior for exec is not well defined and may depend on your OS.

Do you have any code between fork and exec? This may be a problem.

agsamek
pjc50
Check out this http://www.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html . There may be some code injected info fork if there is no your code there. Please prepare the smallest example.
agsamek
> Small testcase may be difficult as I can't reliably reproduce it with the huge application. --- pjc50: this is how we debug the code. We don't guess. We track errors. If you have two cases - one large which produces error and one small which doesn't then you are close to finding it. Just remove large chunks of code and check whether error still exists. Take a look here http://en.wikipedia.org/wiki/Binary_search_algorithm , do your homework and let us know what was the problem.
agsamek
+1  A: 

It's look like a deadlock condition. Look for blocking functions, like accept(), the problem should be there.

A: 

Be very careful with multiple threads and fork. Most of glibc/libstdc++ is thread safe. If a thread, other than the forking thread, is holding a lock when the fork executes the forked process will inherit the mutexes in their current locked state. The new process will never see those mutexes unlocked. For more information see man pthread_atfork.

voxmea
+2  A: 

I came to the conclusion that it was probably this phenomenon:

http://kerneltrap.org/mailarchive/linux-kernel/2008/8/15/2950234/thread

as the bug is difficult to trigger on our development systems but is generally reported by users running on large shared machines; also the forked application starts a JVM, which itself allocates a lot of threads. The problem is also associated with the machine being loaded, and extensive memory usage (we have a machine with 128Gb of RAM and processes may be 10-100G in size).

I've been reading the O'Reilly pthreads book, which explains pthread_atfork(), and suggests the use of a "surrogate parent" process forked from the main process at startup from which subprocesses are run. It also suggests the use of a pre-created thread pool. Both of these seem like good ideas, so I'm going to implement at least one of them.

pjc50
A: 

I've just fallen into same problems, and finally found that fork() duplicates all the threads. Now imagine, what does your program do after a fork() with all the threads running double instance...

The following rules are from "A Mini-guide regarding fork() and Pthreads":

1- You DO NOT WANT to do that.

2- If you needs to fork() then: whenever possible, fork() all your childs prior to starting any threads.

ern0
@ern0 This is not true. fork does not duplicate all the threads, it only duplicates the calling thread as documented here: http://www.opengroup.org/onlinepubs/000095399/functions/fork.html
nos
I haven't tried it yet, but I'll do.
ern0
Tried and confirmed, fork() dupes the calling thread only.
ern0