views:

245

answers:

3

Hi guys,

Various Perl scripts (Server Side Includes) are calling a Perl module with many functions on a website. EDIT: The scripts are using use lib to reference the libraries from a folder. During busy periods the scripts (not the libraries) become zombies and overload the server.

The server lists:

319 ?        Z      0:00 [scriptname1.pl] <defunct>    
320 ?        Z      0:00 [scriptname2.pl] <defunct>    
321 ?        Z      0:00 [scriptname3.pl] <defunct>

I have hundreds of instances of each.

EDIT: We are not using fork, system or exec, apart form the SSI directive

<!--#exec cgi="/cgi-bin/scriptname.pl"-->

As far as I know, in this case httpd itself will be the owner of the process. MaxRequestPerChild is set to 0 which should not let the parents die before the child process is finished.

So far we figured that temporarily suspending some of the scripts help the server coping with the defunct processes and prevent it from falling over however zombie processes are still forming without a doubt. Apparently gbacon seems to be the closest to the truth with his theory that the server is not being able to cope with the load.

What could lead to httpd abandoning these processes? Is there any best practice to prevent these from happening?

Thanks

Answer: The point goes to Rob. As he says, CGI scripts that generate SSI's will not have those SSI's handled. The evaluation of SSI's happens before the running of CGI's in the Apache 1.3 request cycle. This was fixed with Apache 2.0 and later so that CGI's can generate SSI commands.

Since we were running on Apache 1.3, for every page view the SSI's turned into defunct processes. Although the server was trying to clear them it was way too busy with the running tasks to be able to succeed. As a result, the server fell over and become unresponsive. As a short term solution we reviewed all SSI's and moved some of the processes to client side to free up server resources and give it time to clean up. Later we upgraded to Apache 2.2.

+7  A: 

More Band-Aid than best practice, but sometimes you can get away with simple

$SIG{CHLD} = "IGNORE";

According to the perlipc documentation

On most Unix platforms, the CHLD (sometimes also known as CLD) signal has special behavior with respect to a value of 'IGNORE'. Setting $SIG{CHLD} to 'IGNORE' on such a platform has the effect of not creating zombie processes when the parent process fails to wait() on its child processes (i.e., child processes are automatically reaped). Calling wait() with $SIG{CHLD} set to 'IGNORE' usually returns -1 on such platforms.

If you care about the exit statuses of child processes, you need to collect them (commonly referred to as "reaping") by calling wait or waitpid. Despite the creepy name, a zombie is merely a child process that has exited but whose status has not yet been reaped.

If your Perl programs themselves are the child processes becoming zombies, that means their parents (the ones that are forking-and-forgetting your code) need to clean up after themselves. A process cannot stop itself from becoming a zombie.

Greg Bacon
Thanks very much G.I'm not saying I understand how it works but I will read more about it. I assume CHLD goes into the calling script. Is that right?
G Berdal
waitpid(-1, WNOHANG) won't block, so it can be called periodically to collect child exit status. Use a loop like this to reap all your zombies: while (($pid = waitpid(-1, WNOHANG)) > 0) ...
Ken Fox
@G Berdal How are your scripts being started? Do you control that code?
Greg Bacon
@gbacon They are server side includes on pages of a website. Most of them calls a library for functions. The funny thing is that the SSI scripts are the ones becoming zombies according to the logs.
G Berdal
Which web server are you running? (I assume apache, which I would expect to reap children's exit statuses correctly!) Please provide representative samples of the log messages you're seeing related to zombies. You might want to edit your question with to include that information. The log messages will be much more readable there than in a comment.
Greg Bacon
I'm using apache 1.3. - I have added the logs.
G Berdal
You wrote that this happens under heavy load. When traffic backs off, does apache finally reap the zombies, or do they hang around until you restart apache? If the former, the machine could be too busy serving requests to get around to reaping. The latter is likely a bug somewhere out of your control. Either way, the kind folks over at Server Fault will probably be more help to you, and we could move your question there if you'd like.
Greg Bacon
+1 That is a very good question. I will ask the Administrator to check that out for me. I think it is a bit premature to say that it is solely a server issue. I wish it was, then I could simply pass it on to the Administrator. :)
G Berdal
A: 

As you have all the bits yourself, I'd suggest running the individual scripts one at a time from the command line to see if you can spot the ones that are hanging.

Does a ps listing show an inordinate number of instances of one particular script running?

Are you running the CGI's using mod_perl?

Edit: Just saw your comments regarding SSI's. Don't forget that SSI directives can run Perl scripts themselves. Have a look to see what the CGI's are trying to run?

Are they dependent on yet another server or service?

Rob Wells
I have spotted the ones that are hanging. Nearly all of the SSI on the webpage become multiple zombie instances. They are calling a library for functions. What I am not sure about who counts as a parent for these?
G Berdal
The process which called the SSI or CGI is its parent. You could try using `ps` to look up the ppid (parent process id) and then seeing what that process is, but I'm not positive offhand whether `ps` will return a ppid for zombies. (Seems like it should, since the zombie has to know who's supposed to reap it, I just haven't verified that it does work.) For me, the output of `ps -l`includes ppid; check your local man page if your `ps` behaves differently.
Dave Sherohman
@Dave, when a process is a zombie it has no ppid by definition.
Rob Wells
@Dave, my bad. An orphan process has no ppid by definition. A zombie process has exited but has not yet been reaped.
Rob Wells
+2  A: 

I just saw your comment that you are running Apache 1.3 and that may be associated with your problem.

SSI's can run CGI's. But CGI scripts that generate SSI's will not have those SSI's handled. The evaluation of SSI's happens before the running of CGI's in the Apache 1.3 request cycle. This was fixed with Apache 2.0 and later so that CGI's can generate SSI commands.

As I'd suggested above, try running your scripts on their own and have a look at the output. Are they generating SSI's?

Edit: Have you tried launching a trivial Perl CGI script to simply printout a Hello World type HTTP response?

Then if this works add a trivial SSI directives such as

<!--#printenv -->

and see what happens.

Edit 2: Just realised what is probably happening. Zombies occur when a child process exits and isn't reaped. These processes are hanging around and slowly using up resources within the process table. A process without a parent is an orphaned process.

Are you forking off processes within your Perl script? If so, have you added a waitpid() call to the parent?

Have you also got the correct exit within the script?

CORE::exit(0);
Rob Wells
I ran all scripts through the debugger and elminated all errors and warnings. They are generating output properly. We were about to upgrade to 2.0 anyway. Do you think that would help?
G Berdal
Ok. Good work with the debugger to eliminate all errors and warnings. Are any of your Perl CGIs running successfully to completion at all?
Rob Wells
As I said they are running just fine. Apart from the fact that they become zombies they are running perfectly.
G Berdal
@George, they are not running fine if you are getting zombie processes. These processes will hang around and slowly consume the resources of your server.
Rob Wells
@Rob, could using POSIX:_exit(0); help? According to http://perldoc.perl.org/functions/exit.html that avoids END routines and destruction processing. I'm thinking gbacon might be right and the server simply doesn't have time to do garbage collection during busy periods...
G Berdal