tags:

views:

79

answers:

2

I have Perl program based on IO::Async, and it sometimes just exits after a few hours/days without printing any error message whatsoever. There's nothing in dmesg or /var/log either. STDOUT/STDERR are both autoflush(1) so data shouldn't be lost in buffers. It doesn't actually exit from IO::Async::Loop->loop_forever - print I put there just to make sure of that never gets triggered.

Now one way would be to keep peppering the program with more and more prints and hope one of them gives me some clue. Is there better way to get information what was going on in a program that made it exit/silently crash?

+3  A: 

One trick I've used is to run the program under strace or ltrace (or attach to the process using strace). Naturally that was under Linux. Under other operating systems you'd use ktrace or dtrace or whatever is appropriate.

A trick I've used for programs which only exhibit sparse issues over days or week and then only over handfuls among hundreds of systems is to direct the output from my tracer to a FIFO, and have a custom program keep only 10K lines in a ring buffer (and with a handler on SIGPIPE and SIGHUP to dump the current buffer contents into a file. (It's a simple program, but I don't have a copy handy and I'm not going to re-write it tonight; my copy was written for internal use and is owned by a former employer).

The ring buffer allows the program to run indefinitely with fear of running systems out of disk space ... we usually only need a few hundred, even a couple thousand lines of the trace in such matters.

Jim Dennis
It's on Linux. Wouldn't strace/ltrace slow things down horribly? It might be a problem as it takes hours or days for the crash to happen.I'd be quite willing to try it. What's the easiest way to do such a strace ring buffer?
taw
@taw: strace doesn't appreciably slow down most programs. For one thing system calls a relatively small part of the processing overhead for most programs; for another Perl and most languages usually only use one CPU (per process, of course). On a multi-core or hyperthreaded system (which is almost all of them these days) that leaves plenty of CPU horsepower to handle the strace process.Naturally ltrace is more intensive (it works by interposing itself into the dynamic linkage to be hooked into the call path for all normal dynamically dispatched functions). It's still usually okay.
Jim Dennis
@taw: Regarding ring buffer: mine was written in Python, a similar one could be whipped up in a dozen lines of Perl, or less. You define a list/array with n elements (initialized to "None" in Python or undef references in Perl). Then you keep a simple counter of lines as you read them and always replace n % counter (modulo) in your array with the line you've just read. n % counter +1 is thus always the oldest line in your ring buffer (or one of the undef refs that had not yet been cycled over).
Jim Dennis
@taw: to use strace or ltrace with a ring buffer: mkfifo $SOMEFILENAME ; strace -o $SOMEFILENAME ...; $RINGBUFFERUTIL < $SOMEFILENAME > $SOMERESULTFILE (where $SOMERESULTFILE is the name where your ring buffer utility dumps its results on SIGPIPE, SIGHUG, and/or SIGALRM or whatever).
Jim Dennis
Thanks, strace turned out to be helpful, even if I still have no idea why writing to a remote tcp socket can possibly cause SIGPIPE.
taw
Of course it can cause SIGPIPE! SIGPIPE is to signal that processing on the other end of a file descriptor has been terminated. Normal file and device file descriptors should never give SIGPIPE ... but FIFOs, anonymous pipes and sockets have processes (or kernel processing in the case of sockets) on the other end.
Jim Dennis
Incidentally, setsockopt SO_NOSIGPIPE might help on your platform -- perhaps only if you use send() in lieu of write(). Don't know the details. And your Perl environment may not be exposing options to accomplish that.
Jim Dennis
+1  A: 

If you are capturing STDERR, you could start the program as perl -MCarp::Always foo_prog. Carp::Always forces a stack trace on all errors.

daotoad
I'll try that.One very disturbing thing is that in 1 crash in about 100 I got glibc malloc pool memory corruption error, which indicates that the underlying cause might be a binary bug in one of many libraries program uses, and that it will be much harder to find than a pure Perl problem would be.
taw
That sucks. It looks like you are in for a 'fun' time.
daotoad