views:

697

answers:

5

Background

  • I have an application with a Poof-Crash[1]. I'm fairly certain it is due to a blown stack.
  • The application is Multi-Threaded.
  • I am compiling with "Enable C++ Exceptions: Yes With SEH Exceptions (/EHa)".
  • I have written an SE Translator function and called _set_se_translator() with it.
  • I have written functions for and setup set_terminate() and set_unexpected().
  • To get the Stack Overflow, I must run in release mode, under heavy load, for several days. Running under a debugger is not an option as the application can't perform fast enough to achieve the runtime necessary to see the issue.
  • I can simulate the issue by adding infinite recursion on execution of one of the functions, and thus test the catching of the EXCEPTION_STACK_OVERFLOW exception.
  • I have WinDBG setup as the crash dump program, and get good information for all other crash issues but not this one. The crash dump will only contain one thread, which is 'Sleep()'ing. All other threads have exited.

The Question

None of the things I've tried has resulted in picking up the EXCEPTION_STACK_OVERFLOW exception.

Does anyone know how to guarantee getting a a chance at this exception during runtime in release mode?

Definitions

  1. Poof-Crash: The application crashes by going "poof" and disappearing without a trace.

(Considering the name of this site, I'm kind of surprised this question isn't on here already!)

Notes

  1. An answer was posted briefly about adjusting the stack size to potentially force the issue sooner and allow catching it with a debugger. That is a clever thought, but unfortunately, I don't believe it would help. The issue is likely caused by a corner case leading to infinite recursion. Shortening the stack would not expose the issue any sooner and would likely cause an unrelated crash in validly deep code. Nice idea though, and thanks for posting it, even if you did remove it.
+1  A: 

I remember code from a previous workplace that sounded similar having explicit bounds checks on the stack pointer and throwing an exception manually.

It's been a while since I've touched C++ though, and even when I did touch it I didn't know what I was doing, so caveat implementor about portability/reliability of said advice.

nothingmuch
A: 

You can generate debugging symbols without disabiling optimizations. In fact, you should be doing that anyways. It just makes debugging harder.

And the documentation for _set_se_translator says that each thread has its own SE translator. Are you setting one for each thread?

set_unexpected is probably a no-op, at least according to the VS 2005 documentation. And each thread also has its own terminate handler, so you should install that per thread as well.

I would also strongly recommend NOT using SE translation. It takes hardware exceptions that you shouldn't ignore (i.e., you should really log an error and terminate) and turns them into something you can ignore (C++ exceptions). If you want to catch this kind of error, use a __try/__except handler.

MSN

MSN
I do build symbols in release mode, it is not a symbol issue. I'd be happy with an address only callstack I have to decode by hand. I'm getting nothing though. I realize that set_unexpected is a no-op, but wanted to avoid shallow answers that might include it. [continued]
Aaron
__try/__except is not really an issue, due to the magnitude of legacy code. I don't plan to ignore the issue, simply forward them. However, the info on per thread settings is good. I'll explore that avenue and upvote/check if it turns out to be the issue. Thanks for the good info! =D
Aaron
You could also use `SetUnhandledExceptionFilter` or `AddVectoredExceptionHandler` to catch all unhandled or any exceptions in the process, respectively.MSN
MSN
I do have a SetUnhandledExceptionFilter setup, which is where I will be sending the unhandled exceptions. While Threading *Might* be an issue for my configuration, in my testing at least, it does not seem to make a difference. (I'm causing an exception before creating additional Threads).
Aaron
You might also be triggering a pure virtual function call, so you might want to set the purecall handler via `_set_purecall_handler`. And to be extra pedantic, add an exit handler via `atexit`. It's still possible that someone is calling _exit(...) directly.MSN
MSN
+1  A: 

Have you considered ADPlus from Debugging Tools for Windows?

ADPlus attaches the CDB debugger to a process in "crash" mode and will generate crash dumps for most exceptions the process generates. Basically, you run "ADPlus -crash -p yourPIDhere", it performs an invasive attach and begins logging.

Given your comment above about running under a debugger, I just wanted to add that CDB adds virtually zero overhead in -crash mode on a decent (dual-core, 2GB RAM) machine, so don't let that hold you back from trying it.

kirkus
Can you make your answer a bit more specific with regard to how it will solve the issue stated?
Aaron
ADPlus attaches the CDB debugger to a process in "crash" mode and will generate crash dumps for most exceptions the process generates. Basically, you run "ADPlus -crash -p yourPIDhere", it performs an invasive attach and begins logging.
kirkus
Quick sample from the log file: Stack_buffer_overflow [sbo] return: GN GN 1st chance: Log;Time;Stack;MiniDump 2nd chance: Log;Time;Stack;FullDump;EventLogStarting to attach the debugger to each processAttaching to 3248 - RECURSIONTEST.EXE
kirkus
I'll give that a shot. Thanks for the input.
Aaron
Good luck. Given your comment above about running under a debugger, I just wanted to add that CDB adds virtually zero overhead in -crash mode on a decent (dual-core, 2GB RAM) machine, so don't let that hold you back from trying it.
kirkus
Good to know. These are Xeon 5150 Quad Core boxes with 8 Gig of RAM, so I should be alright. =D
Aaron
Can you edit your answer to include what you've written in the comments so it's easier to see?
drhorrible
@Kirkus, I wanted to let you know that I ended up using ADPlus to track down this issue, but not in exactly the way you mentioned. I ended up writing a custom config which logged the stack at each thread exit (our thread count is fairly static, so we can get away with that). This allowed us to pinpoint precisely where the problem was occurring and fix it within 24 hours of detection. Previous to the script, we had spend at least a month trying to figure it out. =D
Aaron
+3  A: 
I'm not so worried about the application crashing, so long as I can figure out where it is going off the rails. I'll look into the Vectored execption handler. Good detail. Thanks! =D
Aaron
My problem turns out to not be a Stack Overflow, but this answer answers the question I asked, about catching a Stack Overflow. Thanks for the great info!
Aaron
+2  A: 

I'm not convinced that you're on the right track in diagnosing this as a stack overflow.

But in any case, the fact that you're getting a poof!, plus what you're seeing in WinDbg

The crash dump will only contain one thread, which is 'Sleep()'ing. All other threads have exited.

suggests to me that somebody has called the C RTL exit() function, or possibly called the Windows API TerminateProcess() directly. That could have something to do with your interrupt handlers or not. Maybe something in the exception handling logic has a re-entrance check and arbitrarily decides to exit() if it's reentered.

My suggestion is to patch your executables to put maybe an INT 3 debug at the entry point to exit (), if it's statically linked, or if it's dynamically linked, patch up the import and also patch up any imports of kernel32::TerminateProcess to throw a DebugBreak() instead.

Of course, exit() and/or TerminateProcess() may be called on a normal shutdown, too, so you'll have to filter out the false alarms, but if you can get the call stack for the case where it's just about to go proof, you should have what you need.

EDIT ADD: Just simply writing your own version of exit() and linking it in instead of the CRTL version might do the trick.

Die in Sente
Thank you, that's very insightful. We are staticly linked, so I will attempt the patched executable or linked in custom exit() you are suggesting. :)
Aaron
You are correct. It turns out to not be an issue of a Stack Overflow. I'm upvoting your answer because it answers the conditions and addresses my mis-analysis of the issue. Thanks for the really useful info and the keen observation of the true issue.
Aaron
Glad I could help. :-)
Die in Sente