views:

806

answers:

3

Note: This question was originally asked here but the bounty time expired even though an acceptable answer was not actually found. I am re-asking this question including all details provided in the original question.

A python script is running a set of class functions every 60 seconds using the sched module:

# sc is a sched.scheduler instance
sc.enter(60, 1, self.doChecks, (sc, False))

The script is running as a daemonised process using the code here.

A number of class methods that are called as part of doChecks use the subprocess module to call system functions in order to get system statistics:

ps = subprocess.Popen(['ps', 'aux'], stdout=subprocess.PIPE).communicate()[0]

This runs fine for a period of time before the entire script crashing with the following error:

File "/home/admin/sd-agent/checks.py", line 436, in getProcesses
File "/usr/lib/python2.4/subprocess.py", line 533, in __init__
File "/usr/lib/python2.4/subprocess.py", line 835, in _get_handles
OSError: [Errno 12] Cannot allocate memory

The output of free -m on the server once the script has crashed is:

$ free -m
                  total       used       free     shared     buffers    cached
Mem:                894        345        549          0          0          0
-/+ buffers/cache:  345        549
Swap:                 0          0          0

The server is running CentOS 5.3. I am unable to reproduce on my own CentOS boxes nor with any other user reporting the same problem.

I have tried a number of things to debug this as suggested in the original question:

  1. Logging the output of free -m before and after the Popen call. There is no significant change in memory usage i.e. memory is not gradually being used up as the script runs.

  2. I added close_fds=True to the Popen call but this made no difference - the script still crashed with the same error. Suggested here and here.

  3. I checked the rlimits which showed (-1, -1) on both RLIMIT_DATA and RLIMIT_AS as suggested here.

  4. An article suggested the having no swap space might be the cause but swap is actually available on demand (according to the web host) and this was also suggested as a bogus cause here.

  5. The processes are being closed because that is the behaviour of using .communicate() as backed up by the Python source code and comments here.

The entire checks can be found at on GitHub here with the getProcesses function defined from line 442. This is called by doChecks() starting at line 520.

The script was run with strace with the following output before the crash:

recv(4, "Total Accesses: 516662\nTotal kBy"..., 234, 0) = 234
gettimeofday({1250893252, 887805}, NULL) = 0
write(3, "2009-08-21 17:20:52,887 - checks"..., 91) = 91
gettimeofday({1250893252, 888362}, NULL) = 0
write(3, "2009-08-21 17:20:52,888 - checks"..., 74) = 74
gettimeofday({1250893252, 888897}, NULL) = 0
write(3, "2009-08-21 17:20:52,888 - checks"..., 67) = 67
gettimeofday({1250893252, 889184}, NULL) = 0
write(3, "2009-08-21 17:20:52,889 - checks"..., 81) = 81
close(4)                                = 0
gettimeofday({1250893252, 889591}, NULL) = 0
write(3, "2009-08-21 17:20:52,889 - checks"..., 63) = 63
pipe([4, 5])                            = 0
pipe([6, 7])                            = 0
fcntl64(7, F_GETFD)                     = 0
fcntl64(7, F_SETFD, FD_CLOEXEC)         = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7f12708) = -1 ENOMEM (Cannot allocate memory)
write(2, "Traceback (most recent call last"..., 35) = 35
open("/usr/bin/sd-agent/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/usr/bin/sd-agent/agent."..., 52) = 52
open("/home/admin/sd-agent/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/home/admin/sd-agent/dae"..., 60) = 60
open("/usr/bin/sd-agent/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/usr/bin/sd-agent/agent."..., 54) = 54
open("/usr/lib/python2.4/sched.py", O_RDONLY|O_LARGEFILE) = 8
write(2, "  File \"/usr/lib/python2.4/sched"..., 55) = 55
fstat64(8, {st_mode=S_IFREG|0644, st_size=4054, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d28000
read(8, "\"\"\"A generally useful event sche"..., 4096) = 4054
write(2, "    ", 4)                     = 4
write(2, "void = action(*argument)\n", 25) = 25
close(8)                                = 0
munmap(0xb7d28000, 4096)                = 0
open("/usr/bin/sd-agent/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/usr/bin/sd-agent/checks"..., 60) = 60
open("/usr/bin/sd-agent/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/usr/bin/sd-agent/checks"..., 64) = 64
open("/usr/lib/python2.4/subprocess.py", O_RDONLY|O_LARGEFILE) = 8
write(2, "  File \"/usr/lib/python2.4/subpr"..., 65) = 65
fstat64(8, {st_mode=S_IFREG|0644, st_size=39931, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d28000
read(8, "# subprocess - Subprocesses with"..., 4096) = 4096
read(8, "lso, the newlines attribute of t"..., 4096) = 4096
read(8, "code < 0:\n        print >>sys.st"..., 4096) = 4096
read(8, "alse does not exist on 2.2.0\ntry"..., 4096) = 4096
read(8, " p2cread\n        # c2pread    <-"..., 4096) = 4096
write(2, "    ", 4)                     = 4
write(2, "errread, errwrite)\n", 19)    = 19
close(8)                                = 0
munmap(0xb7d28000, 4096)                = 0
open("/usr/lib/python2.4/subprocess.py", O_RDONLY|O_LARGEFILE) = 8
write(2, "  File \"/usr/lib/python2.4/subpr"..., 71) = 71
fstat64(8, {st_mode=S_IFREG|0644, st_size=39931, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d28000
read(8, "# subprocess - Subprocesses with"..., 4096) = 4096
read(8, "lso, the newlines attribute of t"..., 4096) = 4096
read(8, "code < 0:\n        print >>sys.st"..., 4096) = 4096
read(8, "alse does not exist on 2.2.0\ntry"..., 4096) = 4096
read(8, " p2cread\n        # c2pread    <-"..., 4096) = 4096
read(8, "table(self, handle):\n           "..., 4096) = 4096
read(8, "rrno using _sys_errlist (or siml"..., 4096) = 4096
read(8, " p2cwrite = None, None\n         "..., 4096) = 4096
write(2, "    ", 4)                     = 4
write(2, "self.pid = os.fork()\n", 21)  = 21
close(8)                                = 0
munmap(0xb7d28000, 4096)                = 0
write(2, "OSError", 7)                  = 7
write(2, ": ", 2)                       = 2
write(2, "[Errno 12] Cannot allocate memor"..., 33) = 33
write(2, "\n", 1)                       = 1
unlink("/var/run/sd-agent.pid")         = 0
close(3)                                = 0
munmap(0xb7e0d000, 4096)                = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x589978}, {0xb89a60, [], SA_RESTORER, 0x589978}, 8) = 0
brk(0xa022000)                          = 0xa022000
exit_group(1)                           = ?
+1  A: 

swap may not be the red herring previously suggested. How big is the python process in question just before the ENOMEM?

Under kernel 2.6, /proc/sys/vm/swappiness controls how aggressively the kernel will turn to swap, and overcommit* files how much and how precisely the kernel may apportion memory with a wink and a nod. Like your facebook relationship status, it's complicated.

...but swap is actually available on demand (according to the web host)...

but not according to the output of your free(1) command, which shows no swap space recognized by your server instance. Now, your web host may certainly know much more than I about this topic, but virtual RHEL/CentOS systems I've used have reported swap available to the guest OS.

Adapting Red Hat KB Article 15252:

A Red Hat Enterprise Linux 5 system will run just fine with no swap space at all as long as the sum of anonymous memory and system V shared memory is less than about 3/4 the amount of RAM. .... Systems with 4GB of ram or less [are recommended to have] a minimum of 2GB of swap space.

Compare your /proc/sys/vm settings to a plain CentOS 5.3 installation. Add a swap file. Ratchet down swappiness and see if you live any longer.

pilcrow
What's the best way to check the size of the python process? ps?
DavidM
something like `ps -o user,pid,vsz="Mem(Kb)" -o cmd $PYTHON_PID`, or top(1), should do it.
pilcrow
A: 

munmap(0xb7d28000, 4096) = 0
write(2, "OSError", 7) = 7

I've seen sloppy code that looks like this:

serrno = errno;
some_Syscall(...)
if (serrno != errno)
/* sound alarm: CATROSTOPHIC ERROR !!! */

You should check to see if this is what is happening in the python code. Errno is only valid if the proceeding system call failed.

Edited to add:

You don't say how long this process lives. Possible consumers of memory

  • forked processes
  • unused data structures
  • shared libraries
  • memory mapped files
codeDr
Yes, but we see from the OP's strace that the first syscall failure -- from clone() -- is ENOMEM as is reported. This error is preserved along python's low-memory stumble through traceback construction, even though the C-library `errno` is reset many times along the way.
pilcrow
hmm.... good catch.
codeDr
A: 

The relevant line in your strace is:

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7f12708) = -1 ENOMEM (Cannot allocate memory)

... I know others have talked about swap and memory availability (and I would recommend that you set up at least a small swap partition, ironically even if it's on a RAM disk ... the code paths through the Linux kernel when it has even a tiny bit of swap available have been exercised far more extensively than those (exception handling paths) in which there is zero swap available.

However I suspect that this is still a red herring.

The fact that free is reporting 0 (ZERO) memory in use by the cache and buffers is very disturbing. I suspect that the free output ... and possibly your application issue here, are caused by some proprietary kernel module which is interfering with the memory allocation in some way.

According to the man pages for fork()/clone() the fork() system call should return EAGAIN if your call would cause a resource limit violation (RLIMIT_NPROC) ... however, it doesn't say if EAGAIN is to be returned by other RLIMIT* violations. In any event if your target/host has some sort of weird Vormetric or other security settings (or even if your process is running under some weird SELinux policy) then it might be causing this -ENOMEM failure.

It's pretty unlikely to be a normal run-of-the-mill Linux/UNIX issue. You've got something non-standard going on there.

Jim Dennis
The server is running on a Media Template (dv) base which uses Virtuozzo for virtualisation.
DavidM