views:

63

answers:

1

I'm just starting to work on a tornado application that is having some CPU issues. The CPU time will monotonically grow as time goes by, maxing out the CPU at 100%. The system is currently designed to not block the main thread. If it needs to do something that blocks and asynchronous drivers aren't available, it will spawn another thread to do the blocking operation.

Thus we have the main thread being almost totally CPU-bound and a bunch of other threads that are almost totally IO-bound. From what I've read, this seems to be the perfect way to run into problems with the GIL. Plus, my profiling shows that we're spending a lot of time waiting on signals (which I'm assuming is what __semwait_signal is doing), which is consistent with the effects the GIL would have in my limited understanding.

If I use sys.setcheckinterval to set the check interval to 300, the CPU growth slows down significantly. What I'm trying to determine is whether I should increase the check interval, leave it at 300, or be scared with upping it. After all, I notice that CPU performance gets better, but I'm a bit concerned that this will negatively impact the system's responsiveness.

Of course, the correct answer is probably that we need to rethink our architecture to take the GIL into account. But that isn't something that can be done immediately. So how do I determine the appropriate course of action to take in the short-term?

+1  A: 

The first thing I would check for would be to ensure that you're properly exiting threads. It's very hard to figure out what's going on with just your description to go from, but you use the word "monotonically," which implies that CPU use is tied to time rather than to load.

You may very well be running into threading limits of Python, but it should vary up and down with load (number of active threads,) and CPU usage (context switching costs) should reduce as those threads exit. Is there some reason for a thread, once created, to live forever? If that's the case, prioritize that rearchitecture. Otherwise, short term would be to figure out why CPU usage is tied to time and not load. It implies that each new thread has a permanent, irreversible cost in your system - meaning it never exits.

Joshua
I'm positive that threads are getting exited properly. Plus, I am seeing some performance difference under load. It's just that the load causes CPU time to grow faster.
Jason Baker