We have a server with 10 running mongrel_cluster instances with apache in front of them, and every now and then one or some of them hang. No activity is seen in the database (we're using activerecord sessions). Mysql with innodb tables. show innodb status shows no locks. show processlist shows nothing.
The server is linux debian 4.0
Ruby is: ruby 1.8.6 (2008-03-03 patchlevel 114) [i486-linux]
Rails is: Rails 1.1.2 (yes, quite old)
We're using the native mysql connector (gem install mysql)
"strace -p PID" gives the following in a loop for the hung mongrel process:
gettimeofday({1219834026, 235289}, NULL) = 0
select(4, [3], [0], [], {0, 905241}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 235477}, NULL) = 0
select(4, [3], [0], [], {0, 905053}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 235654}, NULL) = 0
select(4, [3], [0], [], {0, 904875}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 235829}, NULL) = 0
select(4, [3], [0], [], {0, 904700}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 236017}, NULL) = 0
select(4, [3], [0], [], {0, 904513}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 236192}, NULL) = 0
select(4, [3], [0], [], {0, 904338}) = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 236367}, NULL) = 0
...
I used lsof and found that the process used 67 file descriptors (lsof -p PID |wc -l)
Is there any other way I can debug this, so that I could for example determine which file descriptor is "bad"? Any other info or suggestions? Anybody else seen this?
The site is fairly used, but not overly so, load averages usually around 0.3.
Some additional info. I installed mongrelproctitle to show what the hung processes were doing, and it seems they are hanging on a method that displays images using file_column / images from the database / rmagick to resize and make the images greyscale.
Not conclusive the problem is here, but it is a suspicion. Is there something obviously wrong with the following? The method displays a static image if the order doesn't contain an image, else the image resized from the order. The cache stuff is so that the image gets updated in the browser every time. The image is inserted in the page with a normal image tag.
code:
def preview_image
@order = session[:order]
if @order.image.nil?
@headers['Pragma'] = 'no-cache'
@headers['Cache-Control'] = 'no-cache, must-revalidate'
send_data(EMPTY_PIC.to_blob, :filename => "img.jpg", :type =>
"image/jpeg", :disposition => "inline")
else
@pic = Image.read(@order.image)[0]
if (@order.crop)
@pic.crop!(@order.crop[:x1].to_i, @order.crop[:y1].to_i,
@order.crop[:width].to_i, @order.crop[:height].to_i, true)
end
@pic.resize!(103,130)
@pic = @pic.quantize(256, Magick::GRAYColorspace)
@headers['Pragma'] = 'no-cache'
@headers['Cache-Control'] = 'no-cache, must-revalidate'
send_data(@pic.to_blob, :filename => "img.jpg", :type =>
"image/jpeg", :disposition => "inline")
end
end
Here is the lsof output if anybody can find any problems in it. Don't know how it will format in this message...
lsof: WARNING: can't stat() ext3 file system /dev/.static/dev
Output information may be incomplete.
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
mongrel_r 11628 username cwd DIR 9,2 4096 1870688
/home/domains/example.com/usernameOrder/releases/20080831121802
mongrel_r 11628 username rtd DIR 9,1 4096 2 /
mongrel_r 11628 username txt REG 9,1 3564 167172
/usr/bin/ruby1.8
mongrel_r 11628 username mem REG 0,0 0
[heap] (stat: No such file or directory)
mongrel_r 11628 username DEL REG 0,8 15560245
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560242
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560602
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560601
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560684
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560683
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560685
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560568
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560607
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560569
/dev/zero
mongrel_r 11628 username mem REG 9,1 1933648 456972
/usr/lib/libmysqlclient.so.15.0.0
mongrel_r 11628 username DEL REG 0,8 15442414
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560546
/dev/zero
mongrel_r 11628 username mem REG 9,1 67408 457393
/lib/i686/cmov/libresolv-2.7.so
mongrel_r 11628 username mem REG 9,1 17884 457386
/lib/i686/cmov/libnss_dns-2.7.so
mongrel_r 11628 username DEL REG 0,8 15560541
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560246
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560693
/dev/zero
mongrel_r 11628 username DEL REG 0,8 15560608
/dev/zero
mongrel_r 11628 username mem REG 9,1 25700 164963
/usr/lib/gconv/gconv-modules.cache
mongrel_r 11628 username mem REG 9,1 83708 457384
/lib/i686/cmov/libnsl-2.7.so
mongrel_r 11628 username mem REG 9,1 140602 506903
/var/lib/gems/1.8/gems/mysql-2.7/lib/mysql.so
mongrel_r 11628 username mem REG 9,1 1282816 180935
...
mongrel_r 11628 username 1w REG 9,2 462923 1575329
/home/domains/example.com/usernameOrder/shared/log/mongrel.8001.log
mongrel_r 11628 username 2w REG 9,2 462923 1575329
/home/domains/example.com/usernameOrder/shared/log/mongrel.8001.log
mongrel_r 11628 username 3u IPv4 15442350 TCP
localhost:8001 (LISTEN)
mongrel_r 11628 username 4w REG 9,2 118943548 1575355
/home/domains/example.com/usernameOrder/shared/log/production.log
mongrel_r 11628 username 5u REG 9,1 145306 234226
/tmp/mongrel.11628.0 (deleted)
mongrel_r 11628 username 7u unix 0xc3c12480 15442417
socket
mongrel_r 11628 username 11u REG 9,1 50 234180
/tmp/CGI.11628.2
mongrel_r 11628 username 12u REG 9,1 26228 234227
/tmp/CGI.11628.3
I have installed monit to monitor the server. No automatic restarts yet because of the PID file issue, but maybe I will get the newest version which supports deleting stale PID-files.
It would be nice though to actually fix the problem, because somebody will get disconnects etc if the server need to be restarted all the time (~10 times a day)
The mongrel-processes don't take any large amount of memory when this is happening, and the machine isn't even swapping, so it's probably not a memory leak.
total used free shared buffers cached
Mem: 4152796 4083000 69796 0 616624 2613364
-/+ buffers/cache: 853012 3299784
Swap: 1999992 52 1999940