I'm trying to get my Django based webapp into a working deployment configuration, and after spending a bunch of time trying to get it working under lighttpd / fastcgi, can't get past this problem. When a client logs in for the first time, they receive a large data dump from the server, which is broken into several ~1MB size chunks that are sent back as JSON.
Every so often, the client will receive a truncated response for one of the chunks, I will see this message in the lighttpd logs:
2010-09-14 23:25:01: (mod_fastcgi.c.2582) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:8000
2010-09-14 23:25:01: (mod_fastcgi.c.3382) response already sent out, but backend returned error on socket: tcp:127.0.0.1:8000 for /myapp.fcgi?, terminating connection
I'm really pulling my hair out trying to figure out why this happens (which doesn't happen when running Django in ./manage.py runserver
mode). The following are things I've tried that have had no effect:
Reducing the chunk size from 1MB to 256K. Even though the truncation usually happens at around the 600K - 900K mark, I still got truncations under the 256K chunk size.
Setting the
minspare
andmaxchildren
values on Django'srunfgci
really high so that there will be lots of spare threads hanging around.Setting
maxchildren
to 1 so that there is only one thread.Switching between UNIX socket mode and TCP/IP mode for the fastcgi connection between lighttpd and Django.
I've Googled a lot for this stuff but couldn't find anything that seemed to be a fix for Django (any help seemed to be around tweaking PHP settings).
My setup is:
OSX 10.6.4
Python 2.6.1 (system)
lighttpd installed from Macports (1.4.26_1+ssl)
flup installed from latest Python egg on flup website (tried both 1.0.2 stable and latest 1.0.3 devel)
Django 1.2.1 installed from tarball on Django website
The FastCGI block in my lighttpd config is:
fastcgi.server = ("/myapp.fcgi" =>
("django" =>
(
#"socket" => lighttpd_base + "fcgi.sock",
"host" => "127.0.0.1",
"port" => 8000,
"check-local" => "disable",
"max-procs" => 1,
"debug" => 1
)
)
)
The runfcgi
command I'm using to start Django is currently:
./manage.py runfcgi daemonize=false debug=true host=127.0.0.1 port=8000
method=threaded maxchildren=1
If anyone has any insight into how to stop this from happening, the help would be much appreciated. If I can't solve this relatively quickly I will have to abandon lighttpd + fastcgi and look at Apache + mod_wsgi or perhaps nginx + fastcgi, and the prospect of going into another webserver config is not something I'm looking forward to ...
Thanks in advance for any help.
Edit: Additional Info
I found this page on the lighty forums indicating that it could be Django's fault ... in that case it was that PHP was crashing. I checked my Django-side stuff and discovered that even after a truncation, the Python thread that sent the truncated response would still be running afterwards and would serve subsequent requests, so it looks like the stream is not being broken by the thread hitting an exception and crashing out.
I wanted to figure out whether or not it was Django's fcgi impl or Lighttpd that was at fault here (because that will determine whether or not moving to nginx + fastcgi would actually solve anything), so I took a look at the packet trace in Wireshark. The simplified log of what happens just before a truncation is below:
No. Time Info
30082 233.411743 django > lighttpd [PSH, ACK] Seq=860241 Ack=869 Win=524280 Len=8184 TSV=417114153 TSER=417114153
30083 233.411749 lighttpd > django [ACK] Seq=869 Ack=868425 Win=524280 Len=0 TSV=417114153 TSER=417114153
30084 233.412235 django > lighttpd [PSH, ACK] Seq=868425 Ack=869 Win=524280 Len=8 TSV=417114153 TSER=417114153
30085 233.412250 lighttpd > django [ACK] Seq=869 Ack=868433 Win=524280 Len=0 TSV=417114153 TSER=417114153
30086 233.412615 django > lighttpd [PSH, ACK] Seq=868433 Ack=869 Win=524280 Len=8184 TSV=417114153 TSER=417114153
30087 233.412628 lighttpd > django [ACK] Seq=869 Ack=876617 Win=524280 Len=0 TSV=417114153 TSER=417114153
30088 233.412723 lighttpd > django [FIN, ACK] Seq=869 Ack=876617 Win=524280 Len=0 TSV=417114153 TSER=417114153
30089 233.412734 django > lighttpd [ACK] Seq=876617 Ack=870 Win=524280 Len=0 TSV=417114153 TSER=417114153
30090 233.412740 [TCP Dup ACK 30088#1] lighttpd > django [ACK] Seq=870 Ack=876617 Win=524280 Len=0 TSV=417114153 TSER=417114153
30091 233.413051 django > lighttpd [PSH, ACK] Seq=876617 Ack=870 Win=524280 Len=8 TSV=417114153 TSER=417114153
30092 233.413070 lighttpd > django [RST] Seq=870 Win=0 Len=0
Good packets are coming from Django at the start (30082 for 8184 bytes, and then again at 30086 for another 8184 bytes) and then at entry 30088 for some reason Lighttpd sends a TCP FIN
to Django which is presumably what causes the connection to terminate and that's how you get the truncation.
On the face of it, it seems like this is Lighttpd's fault, since it looks like it is shutting things down before it's supposed to ... although I'm not sure that it isn't doing this because it has received some bad data from Django to which it reacts by shutting down.