Our setup is standard nginx (ver 0.7.59) + thin upstream servers on Debian lenny. Right now we're on 1 beefy box for web / app and 1 db box. Recently we started noticing thins will eventually start "hanging", i.e. they will no longer receive requests from nginx. We have 15 thins running, and after 10-15 minutes, the first 1 or 2 will be hung. If left all day, those same few thins plus a few more will remain hung. The only fix we've seen so far is restarting nginx. After a restart, the hung thins begin receiving requests again right away. Because of this, it seems like those thins might have been taken out of the upstream pool.
If I understand the docs (http://wiki.nginx.org/NginxHttpUpstreamModule#server) correctly, with the defaults (which we have), if nginx can't "communicate" with a backend server 3 times within 10 seconds, it will set that upstream server to an "inoperative" state. It will then wait 10 seconds, then try that server again. That makes sense, but we're seeing the thin hang indefinitely. I tried setting max_fails to 0 for each of the thins, but that didn't help. I can't find out what would cause an upstream server to become permanently "inoperative".
We've seen big growth rate increases recently, so we're not sure if it could be related to that, or just more apparent as a result of more traffic in a shorter period of time.
Is there something else (a changeble directive or other conditions) in nginx that would cause it to take a server completely out of the pool?