views:

94

answers:

2

A rather confusing sequence of events happened, according to my log-file, and I am about to put a lot of the blame on the Python logger, which is a bold claim. I thought I should get some second opinions about whether what I am saying could be true.

I am trying to explain why there is are several large gaps in my log file (around two minutes at a time) during stressful periods for my application when it is missing deadlines.

I am using Python's logging module on a remote server, and have set-up, with a configuration file, for all logs of severity of ERROR or higher to be emailed to me. Typically, only one error will be sent at a time, but during periods of sustained problems, I might get a dozen in a minute - annoying, but nothing that should stress SMTP.

I believe that, after a short spurt of such messages, the Python logging system (or perhaps the SMTP system it is sitting on) is encountering errors or congestion. The call to Python's log is then BLOCKING for two minutes, causing my thread to miss its deadlines. (I was smart enough to move the logging until after the critical path of the application - so I don't care if logging takes me a few seconds, but two minutes is far too long.)

This seems like a rather awkward architecture (for both a logging system that can freeze up, and for an SMTP system (Ubuntu, sendmail) that cannot handle dozens of emails in a minute**), so this surprises me, but it exactly fits the symptoms.

Has anyone had any experience with this? Can anyone describe how to stop it from blocking?

** EDIT # 2 : I actually counted. 170 emails in two hours. Forget the previous edit. I counted wrong. It's late here...

+1  A: 

A two minute pause sounds like a timeout - mostly probably in the networking stack.

Try adding:

*                -       nofile          64000

to your /etc/security/limits.conf file on all of the machines involved and then rebooting all of the machines to ensure it is applied to all running services.

Benjamin Franz
To make sure I understand this: You are suggesting that something (sendmail and/or Python's logging module) is being blocked from opening a file per email by some default security limit, and adding this (presumably larger) limit will increase that limit.
Oddthinking
Right. If you are running into the limit on open filehandles (including sockets in 'TIME_WAIT' state) the OS will prevent you from opening any more until the number falls below the limit. In the case of TCP TIME_WAIT sockets that happens when the normal timeout (which takes a couple of minutes) kicks in.
Benjamin Franz
Thanks. I will rig up a stress-test on SMTP, and then test this out during a quiet period for the server - probably in the next few days. (In the meantime, I stopped sending emails, and only log the errors to a file.)
Oddthinking
+1  A: 

Stress-testing was revealing:

My logging configuration sent critical messages to SMTPHandler, and debug messages to a local log file.

For testing I created a moderately large number of threads (e.g. 50) that waited for a trigger, and then simultaneosly tried to log either a critical message or a debug message, depending on the test.

Test #1: All threads send critical messages: It revealed that the first critical message took about .9 seconds to send. The second critical message took around 1.9 seconds to send. The third longer still, quickly adding up. It seems that the messages that go to email block waiting for each other to complete the send.

Test #2: All threads send debug messages: These ran fairly quickly, from hundreds to thousands of microseconds.

Test #3: A mix of both. It was clear from the results that debug messages were also being blocked waiting for critical message's emails to go out.

So, it wasn't that 2 minutes meant there was a timeout. It was the two minutes represented a large number of threads blocked waiting in the queue.

Why were there so many critical messages being sent at once? That's the irony. There was a logging.debug() call inside a method that included a network call. I had some code monitoring the speed of the of the method (to see if the network call was taking too long). If so, it (of course) logged a critical error that sent an email. The next thread then blocked on the logging.debug() call, meaning it missed the deadline, triggering another email, triggering another thread to run slowly.

The 2 minute delay in one thread wasn't a network timeout. It was one thread waiting for another thread, that was blocked for 1 minute 57 - because it was waiting for another thread blocked for 1 minute 55, etc. etc. etc.

This isn't very pretty behaviour from SMTPHandler.

Oddthinking