I have a system written in python that processes large amounts of data using plug-ins written by several developers with varying levels of experience.
Basically, the application starts several worker threads, then feeds them data. Each thread determines the plugin to use for an item and asks it to process the item. A plug-in is just a python module with a specific function defined. The processing usually involves regular expressions, and should not take more than a second or so.
Occasionally, one of the plugins will take minutes to complete, pegging the CPU on 100% for the whole time. This is usually caused by a sub-optimal regular expression paired with a data item that exposes that inefficiency.
This is where things get tricky. If I have a suspicion of who the culprit is, I can examine its code and find the problem. However, sometimes I'm not so lucky.
- I can't go single threaded. It would probably take weeks to reproduce the problem if I do.
- Putting a timer on the plugin doesn't help, because when it freezes it takes the GIL with it, and all the other plugins also take minutes to complete.
- (In case you were wondering, the SRE engine doesn't release the GIL).
- As far as I can tell profiling is pretty useless when multithreading.
Short of rewriting the whole architecture into multiprocessing, any way I can find out who is eating all my CPU?
ADDED: In answer to some of the comments:
Profiling multithreaded code in python is not useful because the profiler measures the total function time and not the active cpu time. Try cProfile.run('time.sleep(3)') to see what I mean. (credit to rog [last comment]).
The reason that going single threaded is tricky is because only 1 item in 20,000 is causing the problem, and I don't know which one it is. Running multithreaded allows me to go through 20,000 items in about an hour, while single threaded can take much longer (there's a lot of network latency involved). There are some more complications that I'd rather not get into right now.
That said, it's not a bad idea to try to serialize the specific code that calls the plugins, so that timing of one will not affect the timing of the others. I'll try that and report back.