views:

688

answers:

11

Please help! I'm really at my wits' end. My program is a little personal notes manager (google for "cintanotes"). On some computers (and of course I own none of them) it crashes with an unhandled exception just after start. Nothing special about these computers could be said, except that they tend to have AMD CPUs.

Environment: Windows XP, Visual C++ 2005/2008, raw WinApi.

Here is what is certain about this "Heisenbug":

1) The crash happens only in the Release version.

2) The crash goes away as soon as I remove all GDI-related stuff.

3) BoundChecker has no complains.

4) Writing a log shows that the crash happens on a declaration of a local int variable! How could that be? Memory corruption?

Any ideas would be greatly appreciated!

UPDATE: I've managed to get the app debugged on a "faulty" PC. The results:

"Unhandled exception at 0x0044a26a in CintaNotes.exe: 0xC000001D: Illegal Instruction."

and code breaks on

0044A26A cvtsi2sd xmm1,dword ptr [esp+14h]

So it seems that the problem was in the "Code Generation/Enable Enhanced Instruction Set" compiler option. It was set to "/arch:SSE2" and was crashing on the machines that didn't support SSE2. I've set this option to "Not Set" and the bug is gone. Phew!

Thank you all very much for help!!

+1  A: 

Most heisenbugs / release-only bugs are due to either flow of control that depends on reads from uninitialised memory / stale pointers / past end of buffers, or race conditions, or both.

Try overriding your allocators so they zero out memory when allocating. Does the problem go away (or become more reproducible?)

Writig a log shows that the crash happens on a declaration of a local int variable! How could that be? Memory corruption?

Stack overflow! ;)

moonshadow
Thanks for the idea, I'll definitely inverstigate in this direction
Alex Jenter
+5  A: 

So it doesnnt crash when configuration is DEBUG Configuration? There are many things different than a RELEASE configruation: 1.) Initialization of globals 2.) Actual machine Code generated etc..

So first step is find out what are exact settings for each parameter in the RELEASE mode as compared to the DEBUG mode.

-AD

goldenmean
Sounds like a good idea to me, I'll try that too
Alex Jenter
+4  A: 

1) The crash happens only in the Release version.

That's usually a sign that you're relying on some behaviour that's not guaranteed, but happens to be true in the debug build. For example, if you forget to initialize your variables, or access an array out of bounds. Make sure you've turned on all the compiler checks (/RTCsuc). Also check things like relying on the order of evaluation of function parameters (which isn't guaranteed).

2) The crash goes away as soon as I remove all GDI-related stuff.

Maybe that's a hint that you're doing something wrong with the GDI related stuff? Are you using HANDLEs after they've been freed, for example?

Anthony Williams
I indeed had one problem with the HFONT handles, but got rid of it as soon as BoundsChecker pointed it out to me. But unfortunately the bug was unaffected by this change.
Alex Jenter
+1  A: 

Sounds like stack corruption to me. My favorite tool to track those down is IDA Pro. Of course you don't have that access to the user's machine.

Some memory checkers have a hard time catching stack corruption ( if it indeed that ). The surest way to get those I think is runtime analysis.

This can also be due to corruption in an exception path, even if the exception was handled. Do you debug with 'catch first-chance exceptions' turned on? You should as long as you can. It does get annoying after a while in many cases.

Can you send those users a checked version of your application? Check out Minidump Handle that exception and write out a dump. Then use WinDbg to debug on your end.

Another method is writing very detailed logs. Create a "Log every single action" option, and ask the user to turn that on and send it too you. Dump out memory to the logs. Check out '_CrtDbgReport()' on MSDN.

Good Luck!

EDIT:

Responding to your comment: An error on a local variable declaration is not surprising to me. I've seen this a lot. It's usually due to a corrupted stack.

Some variable on the stack may be running over it's boundaries for example. All hell breaks loose after that. Then stack variable declarations throw random memory errors, virtual tables get corrupted, etc.

Anytime I've seen those for a prolong period of time, I've had to go to IDA Pro. Detailed runtime disassembly debugging is the only thing I know that really gets those reliably.

Many developers use WinDbg for this kind of analysis. That's why I also suggested Minidump.

kervin
Thanks for all the ideas. I've already written a log, and it pointed on an int variable declaration. I'm not joking, it was the code like this: log << " before"; log.flush(); int i; log << " after" ; log.flush();- and only "before" was in the log file.
Alex Jenter
+9  A: 

4) Writig a log shows that the crash happen on a declaration of a local int variable! how could that be? Memory corruption?

What is the underlying code in the executable / assembly? Declaration of int is no code at all, and as such cannot crash. Do you initialize the int somehow?

To see the code where the crash happened you should perform what is called a postmortem analysis.

Windows Error Reporting

If you want to analyse the crash, you should get a crash dump. One option for this is to register for Windows Error Reporting - requires some money (you need a digital code signing ID) and some form filling. For more visit https://winqual.microsoft.com/ .

Get the crash dump intended for WER directly from the customer

Another option is to get in touch witch some user who is experiencing the crash and get a crash dump intended for WER from him directly. The user can do this when he clicks on the Technical details before sending the crash to Microsoft - the crash dump file location can be checked there.

Your own minidump

Another option is to register your own exception handler, handle the exception and write a minidump anywhere you wish. Detailed description can be found at Code Project Post-Mortem Debugging Your Application with Minidumps and Visual Studio .NET article.

Suma
Make sure that you are building debug info (PDB files) for your application in RELEASE (as well as debug) modes. Make sure you keep the set of PDBs for each released version so you can use them with the dump. Maybe use a local symbol server. Vote Suma answer up - it's the right one!
Aardvark
Thanks, I'll try the last idea with the minidump. Unfortunately I'm not used to the low-level debugging, so I'll need to read more on this...
Alex Jenter
+1  A: 

4) Writig a log shows that the crash happen on a declaration of a local int variable!how could that be? Memory corruption

I've found the cause to numerous "strange crashes" to be dereferencing of a broken this inside a member function of said object.

Johann Gerell
Could you elaborate, what exactly is a "broken this"?
Alex Jenter
Alex, broken `this` is like... this: string ps = new string; delete ps; ps->clear(). When you step inside clear() you will see broken `this`.
Constantin
+1  A: 

Try Rational (IBM) PurifyPlus. It catches alot of errors BoundsChecker doesn't.

shoosh
Thanks for the idea. How do I get it to run in the Demo mode? It asks for a License server.
Alex Jenter
+1  A: 

What does the crash say ? Access violation ? Exception ? That would be the further clue to solve this with

Ensure you have no preceeding memory corruptions using PageHeap.exe

Ensure you have no stack overflow (CBig array[1000000])

Ensure that you have no un-initialized memory.

Further you can run the release version also inside the debugger, once you generate debug symbols (not the same as creating debug version) for the process. Step through and see if you are getting any warnings in the debugger trace window.

+2  A: 

Download the Debugging tools for Windows package. Set the symbol paths correctly, then run your application under WinDbg. At some point, it will break with an Access Violation. Then you should run the command "!analyze -v", which is quite smart and should give you a hint on whats going wrong.

do I need to do this locally on a problem machine?
Alex Jenter
+2  A: 

"4) Writing a log shows that the crash happens on a declaration of a local int variable! How could that be? Memory corruption?"

This could be a sign that the hardware is in fact faulty or being pushed too hard. Find out if they've overclocked their computer.

Mike Dimmick
I think that's not the case. It happened on many PCs that were not overclocked.
Alex Jenter
+1  A: 

When I get this type of thing, i try running the code through gimpels PC-Lint (static code analysis) as it checks different classes of errors to BoundsChecker. If you are using Boundschecker, turn on the memory poisoning options.

You mention AMD CPUs. Have you investigated whether there is a similar graphics card / driver version and / or configuration in place on the machines that crash? Does it always crash on these machines or just occasionally? Maybe run the System Information tool on these machines and see what they have in common,

Shane MacLaughlin