views:

303

answers:

5

I have a system written in python that processes large amounts of data using plug-ins written by several developers with varying levels of experience.

Basically, the application starts several worker threads, then feeds them data. Each thread determines the plugin to use for an item and asks it to process the item. A plug-in is just a python module with a specific function defined. The processing usually involves regular expressions, and should not take more than a second or so.

Occasionally, one of the plugins will take minutes to complete, pegging the CPU on 100% for the whole time. This is usually caused by a sub-optimal regular expression paired with a data item that exposes that inefficiency.

This is where things get tricky. If I have a suspicion of who the culprit is, I can examine its code and find the problem. However, sometimes I'm not so lucky.

  • I can't go single threaded. It would probably take weeks to reproduce the problem if I do.
  • Putting a timer on the plugin doesn't help, because when it freezes it takes the GIL with it, and all the other plugins also take minutes to complete.
  • (In case you were wondering, the SRE engine doesn't release the GIL).
  • As far as I can tell profiling is pretty useless when multithreading.

Short of rewriting the whole architecture into multiprocessing, any way I can find out who is eating all my CPU?

ADDED: In answer to some of the comments:

  1. Profiling multithreaded code in python is not useful because the profiler measures the total function time and not the active cpu time. Try cProfile.run('time.sleep(3)') to see what I mean. (credit to rog [last comment]).

  2. The reason that going single threaded is tricky is because only 1 item in 20,000 is causing the problem, and I don't know which one it is. Running multithreaded allows me to go through 20,000 items in about an hour, while single threaded can take much longer (there's a lot of network latency involved). There are some more complications that I'd rather not get into right now.

That said, it's not a bad idea to try to serialize the specific code that calls the plugins, so that timing of one will not affect the timing of the others. I'll try that and report back.

A: 

As you said, because of the GIL it is impossible within the same process.

I recommend to start a second monitor process, which listens for life beats from another thread in your original app. Once that time beat is missing for a specified amount of time, the monitor can kill your app and restart it.

wr
I'm already doing that, but that's not really a solution
itsadok
A: 

If would suggest as you have control over framework disable all but one plugin and see. Basically if you have P1, P2...Pn plugins run N process and disable P1 in first, P2 in second and so on

it would be much faster as compared to your multithreaded run, as no GIL blocking and you will come to know sooner which plugin is the culprit.

Anurag Uniyal
+3  A: 

You apparently don't need multithreading, only concurrency because your threads don't share any state :

Try multiprocessing instead of multithreading

Single thread / N subprocesses. There you can time each request, since no GIL is hold.

Other possibility is to get rid of multiple execution threads and use event-based network programming (ie use twisted)

makapuf
The other advantage of multiprocessing is that you'll be able to 'see' the process, and get the pid.
monkut
A: 

I'd still look at nosklo's suggestion. You could profile on a single thread to find the item, and get the dump at your very long run an possibly see the culprit. Yeah, I know it's 20,000 items and will take a long time, but sometimes you just got to suck it up and find the darn thing to convince yourself the problem is caught and taken care of. Run the script, and go work on something else constructive. Come back and analyze results. That's what separates the men from the boys sometimes;-)

Or/And, add logging information that tracks the time to execute each item as it is processed from each plugin. Look at the log data at the end of your program being run, and see which one took an awful long time to run compared to the others.

DoxaLogos
+1  A: 

Have you seen yappi? It supports profiling multihreaded Python applications. See http://code.google.com/p/yappi/.

sumercip
That looked promising, but unfortunately crashed python on my program both in Windows and Linux. Doesn't seem like there's much active development on that project, either.
itsadok
can you send the traceback where it is failing or project file, if it is a separate project file? my mail is sumerc at gmail com
sumercip
in yappi setup.py, there is a compile param as DEBUG_CALL which will give you the executed functions inside the C extension. With that you can get a valid traceback and I will happily try to fix the problem. We have tested the application in live server with multiple threads for weeks and also on django and some other simple multithreaded programs without any problem.
sumercip