views:

89

answers:

0

I've seen this question posted a few times using a Google search, with no real answers. I have a multi-threaded FastCGI application running with Apache 2.2 on FreeBSD 7.2. There are a few issues with it, and I am unable to really figure out the source of the problem even after poking through a bunch of the mod_fastcgi source code.

My FastCGI application gets anywhere from 2 to 15 or so hits per second, and mostly services a back-end API (the majority of web server usage is for this, and not actually serving content). Everything seems to work ok under normal conditions, but recently this problem has been becoming worse.

It starts out with the FastCGI process manager apparently trying to close unneeded processes, sending them a SIGTERM signal. I catch the signal, clean up some stuff, and exit (by calling exit()) with status code 0. This process seems to result in three log messages in my httpd error log:

[Tue Jun 01 14:03:31 2010] [warn] FastCGI: (dynamic) server "/home/program/wwwroot/domains/www.mydomain.com/cgi-bin/program.cgi" (pid 98182) termination signaled
[Tue Jun 01 14:03:31 2010] [warn] FastCGI: (dynamic) server "/home/program/wwwroot/domains/www.mydomain.com/cgi-bin/program.cgi" (pid 98182) terminated by calling exit with status '0'
[Tue Jun 01 14:03:31 2010] [warn] FastCGI: (dynamic) server "/home/program/wwwroot/domains/www.mydomain.com/cgi-bin/program.cgi" restarted (pid 98294)

I am not sure why it says it is restarting the process, but in any case no core dump is ever generated so I do believe it is the FastCGI process manager doing it's thing. This makes sense because it begins to happen after the initial load increase from restarting Apache. Since it's down for a few seconds, it gets hit with a couple of hundred requests over the first few seconds it's running again (sometimes even hitting the upper limit of MAXCLIENTS in Apache), and this seems to be the process manager doing the work of spawning more processes to handle the increased load.

So this all seems fine, but here is where things get weird. There are really two problems that I see.

First, my multithreaded FastCGI process spawns 25 worker threads, and all seem to be used according to my internal log files (multiple processes are clearly using multiple threads to do work). However it seems that 3 or 4 FastCGI processes is not enough to handle the 5 to 15 hit per second load, even though the requests take about .02s or so to process internally. In order to be at all responsive, it seems I need 50 or more FastCGI processes, leading me to believe that FastCGI does not realize that my program is multithreaded. I've read the documentation at http://www.fastcgi.com/mod_fastcgi/docs/mod_fastcgi.html and do not see any option pertaining to multithreaded-ness, and my internal code is more or less set up just like the examples provided by the FastCGI library.

The second problem I am having is that once process termination has happened a bunch of times as above (and seemingly at random), I begin getting a lot of these messages in my error log:

[Tue Jun 01 14:06:22 2010] [warn] (32)Broken pipe: FastCGI: write() to PM failed (ignore if a restart or shutdown is pending)

The messages occur for about half the hits I get to the server, and it completely kills the responsiveness of my application - it seems FastCGI will look for a working "pipe" until it finds one, and fail to realize that whatever application it is trying to contact is dead. It does still work though, it's just incredibly unresponsive - sometimes taking up to 40 or so seconds to process a request.

I recompiled mod_fastcgi with some extra debugging around the point of the error message, and it appears that the error happens when it tries to write() to the application. The call to write() fails with a -1 return code, and sets errno to EPIPE.

I am noticing that the issue happens mostly when either a crash occurs in one of the FastCGI processes, or a bunch of them are seemingly terminated by the process manager. I haven't had any core dumps though, except for one, where the backtrace outputted by gdb is just a single call to free() at address 0x0000000000000000 with nothing else in the stack trace, so I don't really know what to make of that. I'm thinking it happens sometime after the SIGTERM signal is caught, maybe some global variable not being cleaned up properly or something.