views:

119

answers:

1

This problem appeared today and I have no idea what is going on. Please share you ideas.

I have 1 EC2 DB server (MYSQL + NFS File Sharing + Memcached).

And I have 3 EC2 Web servers (lighttpd) where it will mounted the NFS folders on the DB server.

Everything going smoothly for months but suddenly there is an interesting phenomenon.

In every 8 minutes to 10 minutes, PHP file will be unreachable. This will last about 1 minute and then back to normal. Normal files like .html file are unaffected. All servers have the same problem exactly at the same time.

I have spent one whole day to analysis the reason. Finally, I find out when the problem appear, the file descriptor of lighttpd suddenly increased a lot.


I used ls /proc/1234/fd | wc -l to check the number of fd.

The # of fd is around 250 in normal time. However, when the problem appeared, it will be raised to 1500 and then back to normal.

It sounds funny, right? Do you have any idea what's going on?

======================== The CPU graph of one of the web server. alt text

A: 

Thoughts:

  • Have a look at dmesg output.
  • The number of file descriptors jumping up sounds to me like something is blocking, including the processing of connections to the lighttpd/PHP, which builds up untile the blocking condition ends.
  • When you say the PHP file is unreachable, do you mean the file is missing? Or maybe the PHP script stalls during execution or? What do the lihttpd log files say is happening on the calls to this PHP script. Are there any other hints in the lighttpd?
  • What is the maximum file descriptors for the process/user?
  • I and others have had bizarre networking behavior on EC2 instances from time to time. Give us more details on it. Maybe setup some additional monitoring of the connectivity between your instances. Consider moving your problem instance to another instance in the hopes of the problem magically disappearing. (Shot in the dark.)

And finally...

  • DOS attack? I doubt it--it would be offline or not. It is way too early in the debugging process for you to infere malice on someone elses part.
Stu Thompson
well, the problem has gone automatically. I didn't change anything. May be it's EC2 problem.
yea, ok...then this very much sounds like one of those magical EC2 burps...good luck!
Stu Thompson