views:

1306

answers:

5

We run a medium-size site that gets a few hundred thousand pageviews a day. Up until last weekend we ran with a load usually below 0.2 on a virtual machine. The OS is Ubuntu.

When deploying the latest version of our application, we also did an apt-get dist-upgrade before deploying. After we had deployed we noticed that the load on the CPU had spiked dramatically (sometimes reaching 10 and stopping to respond to page requests).

We tried dumping a full minute of Xdebug profiling data from PHP, but looking through it revealed only a few somewhat slow parts, but nothing to explain the huge jump.

We are now pretty sure that nothing in the new version of our website is triggering the problem, but we have no way to be sure. We have rolled back a lot of the changes, but the problem still persists.

When look at processes, we see that single Apache processes use quite a bit of CPU over a longer period of time than strictly necessary. However, when using strace on the affected process, we never see anything but

accept(3,

and it hangs for a while before receiving a new connection, so we can't actually see what is causing the problem.

The stack is PHP 5, Apache 2 (prefork), MySQL 5.1. Most things run through Memcached. We've tried APC and eAccelerator.

So, what should be our next step? Are there any profiling methods we overlooked/don't know about?

+1  A: 

Perhaps you where using worker MPM before and now you are not?

I know PHP5 does not work with the Worker MPM. On my Ubuntu server, PHP5 can only be installed with the Prefork MPM. It seems that PHP5 module is not compatible with multithreading version of Apache.

I found a link here that will show you how to get better performance with mod_fcgid

To see what worker MPM is see here.

Paul Whelan
Apache is still running using prefork. PHP is working fine.
Vegard Larsen
Out of ideas then I am afraid I thought you might of been using php4 in your old version of the application and now since the updrade to php5 apapche is running in prefork mode. Was your old version of the application using php4?
Paul Whelan
Maybe about a month old. We do upgrades before every deployment. We might stop doing that after this problem, though... :)
Vegard Larsen
+1  A: 

I'd use dTrace to solve this mystery... if it was running on Solaris or Mac... but since Linux doesn't have it you might want to try their Systemtap, however I can't say anything it's usability since I haven't used it.

With dTrace you could easily sniff out the culprits within a day, and would hope with Systemtap it would be similiar

Robert Gould
Systemtap seems a bit to complicated for now.
Vegard Larsen
A: 

Another option that I can't assure you will do any good, but it's more than worth the effort. Is to read the detailed changelog for the new version, and review what might have changed that could remotely affect you.

Going through the changelogs has saved me more than once. Especially when some config options have changed and when something got deprecated. Worst case is it'll give you some clues as to where to look next

Robert Gould
For this case, it has not really helped. We did do this initially, and found some performance problems, but rolling back those changes did not solve the problem, unfortunately.
Vegard Larsen
A: 

Seeing an accept() call from your Apache process isn't at all unusual - that's the webserver waiting for a new request.

First of all, you want to establish what the parameters of the load are. Something like

vmstat 1

will show you what your system is up to. Look in the 'swap' and 'io' columns. If you see anything other than '0' in the 'si' and 'so' columns, your system is swapping because of a low memory condition. Consider reducing the number of running Apache children, or throwing more RAM in your server.

If RAM isn't an issue, look at the 'cpu' columns. You're interested in the 'us' and 'sy' columns. These show you the percentage of CPU time spent in either user processes or system. A high 'us' number points the finger at Apache or your scripts - or potentially something else on the server.

Running

top

will show you which processes are the most active.

Have you ruled out your database? The most common cause of unexpectedly high load I've seen on production LAMP stacks come down to database queries. You may have deployed new code with an expensive query in it; or got to the point where there are enough rows in your dataset to cause previously cheap queries to become expensive.

During periods of high load, do

echo "show full processlist" | mysql | grep -v Sleep

to see if there are either long-running queries, or huge numbers of the same query operating at once. Other mysql tools will help you optimise these.

You may find it useful to configure and use mod_status for Apache, which will allow you to see what request each Apache child is serving and for how long it has been doing so.

Finally, get some long-term statistical monitoring set up. Something like zabbix is straightforward to configure, and will let you monitor resource usage over time, such that if things get slow, you have historical baselines to compare against, and a better ieda of when problems started.

Jon Topper
The problem is Apache using the CPU. There is more than enough RAM (we ran on 512MB before the upgrade, now we have 2GB). No swapping is happening. MySQL slow query log reports nothing unusual. We are now seeing load spiking at 40, during heavy use.
Vegard Larsen
mod_status is your best bet from here. Also, to strace all of your Apache processes, rather than just the parent, try: ps aux | grep h[t]tpd | awk '{ print " -p"$2 }' | xargs strace
Jon Topper
+2  A: 

The answer ended up being not-Apache related. As mentioned, we were on a virtual machine. Our user sessions are pretty big (think 500kB per active user), so we had a lot of disk IO. The disk was nearly full, meaning that Ubuntu spent a lot of time moving things around (or so we think). There was no easy way to extend the disk (because it was not set up properly for VMWare). This completely killed performance, and Apache and MySQL would occasionally use 100% CPU (for a very short time), and the system would be so slow to update the CPU usage meters that it seemed to be stuck there.

We ended up setting up a new VM (which also gave us the opportunity to thoroughly document everything on the server). On the new VM we allocated plenty of disk space, and moved sessions into memory (using memcached). Our load dropped to 0.2 on off-peak use and around 1 near peak use (on a 2-CPU VM). Moving the sessions into memcached took a lot of disk IO away (we were constantly using about 2MB/s of disk IO, which is very bad).

Conclusion; sometimes you just have to start over... :)

Vegard Larsen