views:

131

answers:

5

I am trying to track down a very odd crash. What is so odd about it is a workaround that someone discovered and which I cannot explain.

The workaround is this small program which I'll refer to as 'runner':

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>

int main(int argc, char *argv[])
{
    if (argc == 1)
    {
        fprintf(stderr, "Usage: %s prog [args ...]\n", argv[0]);
        return 1;
    }

    execvp(argv[1], argv + 1);

    fprintf(stderr, "execv failed: %s\n", strerror(errno));

    // If exec returns because the program is not found or we
    // don't have the appropriate permission
    return 255;
}

As you can see, all this program does is use execvp to replace itself with a different program.

The program crashes when it is directly invoked from the command line:

/path/to/prog args  # this crashes

but works fine when it is indirectly invoked via my runner shim:

/path/to/runner /path/to/prog args   # works successfully

For the life of me, I can figure out how having an extra exec can change the behavior of the program being run (as you can see the program does not change the environment).

Some background on the crash. The crash itself is happening in the C++ runtime. Specifically, when the program does a throw, the crashing version incorrectly thinks there is no matching catch (although there is) and calls terminate. When I invoke the program via runner, the exception is properly caught.

My question is any idea why the extra exec changes the behavior of the exec'ed program?

A: 

As a shot in the dark: the double-exec may change the order of environment variables in RAM.

Environment is a memory structure with pointers; the kernel copies that structure into the address space of the new process. The actual order of elements in RAM may change during that copy (environment variables are not semantically ordered, but addresses in RAM have an order). With two exec(), the order may be modified twice.

That a change of the ordering of strings in RAM unearths a bug is somewhat freakish, but stranger things have happened.

Thomas Pornin
Thanks for the suggestion but that does not seem to be it. I dumped the raw environment block and they have the same order in both.
R Samuel Klatchko
+3  A: 

It's possible that the .so files loaded by the runner are causing the runee to work correctly. Try ldd'ing each of the binaries and see if any libraries are loading different versions/locations.

Mark B
The issue is whether `ld-linux.so.2` maps a specific shared object into the address space before the main binary or after (the actual bug is elsewhere but due to circumstances, the bug only manifests when the SO is mapped with a lower address then the main binary).
R Samuel Klatchko
A: 

I wonder if you're passing something different in argv[0] to what the shell is. I can't see obviously from what you're writing above, but it's possible that you're setting argv[0] to the actual first argument to the program, whereas the shell sets it to its called name (e.g. full or short path)

MarkR
@MarkR - thanks for your suggestion. I modified runner so that `argv[0]` would not include the path. Unfortunately, I am still seeing the same behavior.
R Samuel Klatchko
+1  A: 

Perhaps the called program has a memory leak. Try running it with valgrind or some other memory checking tool. After you have a memory error everything else is undefined behaviour (and so everything can happen).

baol
Other then the normal still reachable at exit blocks, Valgrind does not detect any errors on the version that terminates (or the version that doesn't terminate for that matter).
R Samuel Klatchko
A: 

I guess two things you could compare between 'working' and 'crashing' versions - open file descriptors and signal handlers - as these do get passed on by exec.

I can't see how they are the problem / be different, but it might be worth eliminating them.

Douglas Leeder