How to hunt a Heisenbug

views:

582

answers:

+6 Q:

How to hunt a Heisenbug

Recently, we received a bug report from one of our users: something on the screen was displayed incorrectly in our software. Somehow, we could not reproduce this in our development environment (Delphi 2007).

After some further study, it appears that this bug only manifests itself when "Code optimization" is turned on.

Are there any people here with experience in hunting down such a Heisenbug? Any specific constructs or coding bugs that commonly cause such an issue in Delphi software? Any places you would start looking?

I'll also just start debugging the whole thing in the usual way, but any tips specific to Optimization-related bugs (*) would be more than welcome!

(*) Note: I don't mean to say that the bug is caused by the optimizer; I think it's much more likely some wonky construct in the code is somehow pushed "over the edge" by the optimizer.

Update

It seems the bug boils down to a record being fully initialized with zeros when there's no code optimization, and the same record containing some random data when there is optimization. In this case, the random data seems to cause an enum type to contain invalid data (to my great surprise!).

Solution

The solution turned out to involve an unitialized local record variable somewhere deep in the code. Apparently, without optimization the record was reset (heap?), and with optimization turned on, the record was filled with the usual garbage. Thanks to you all for your contributions --- I learned a lot along the way!

+11 A:

Typically bugs of this form are caused by invalid memory access (reading uninitialised data, reading off the end of a buffer...) or thread race conditions.

The former will be affected by optimisations causing data layout to be rearranged in memory, and/or possibly by debug code that initialises newly allocated memory to some value; causing the incorrect code to "accidentally work".

The latter will be affected due to timings changing between optimisation levels. The former is generally much more likely.

If you have some automated way of making freshly allocated memory be filled with some constant value before it is passed to the program, and this makes the crash go away or become reproducible in the debug build, that'll provide a good point to start chasing things.

moonshadow 2009-10-15 18:16:18

Hm, yeah, I think it's the unitialized memory then (it's a single-threaded app). I'll keep an eye on that while debugging the code.

onnodb 2009-10-15 18:46:26

I've seen this kind of bug with uninitialized variables or memory rearrangement a lot with C/C++ over the years. I've also seen many kinds of timing bugs affected by optimization levels, not just thread issues. For instance, I've seen code that wrote to a serial port or network connection, did something else for a while, and then tried to read a response without allowing for the possibility the response wouldn't be there yet. It worked fine when the code wasn't optimized, and failed when the code was optimized.

Bob Murphy 2009-10-15 19:15:05

+2 A:

Especially in purely native languages, like Delphi, you should be more than careful not to abuse the freedom to be able to cast anything to anything. IOW: One thing, I have seen is that someone copies the definition of a class (e.g. from the implementation section in RTL or VCL) into his own code and then cast instances of the original class to his copy. Now, after upgrading the library where the original class came from, you might experience all kinds of weird stuff. Like jumping into the wrong methods or bufferoverflows.

There's also the habit of using signed integer as pointers and vice-versa. (Instead of cardinal) this works perfectly fine as long as your process has only 2GB of address space. But boot with the /3GB switch and you will see a lot of apps that start acting crazy. Those made the assumption of "pointer=signed integer" at least somewhere. Your customer uses a 64Bit Windows? Chances are, he might have a larger address space for 32Bit apps. Pretty tough to debug w/o having such a test system available.

Then, there's race conditions. Like having 2 threads, where one is very, very slow. So that you instinctively assume it will always be the last one and so there's no code that handles the scenario where "Captn slow" finishes first. Changes in the underlying technologies can make these assumptions very wrong, very fast indeed. Take a look at the upcoming breed of Flash-based super-mega-fast server storage. Systems that can read and write Gigabytes per second. Applications that assume the IO stuff to be significantly slower than some calculations on in-memory values will easily fail on this kind of fast storage.

I could go on and on, but I gotta run right now... Cheers

Robert Giesecke 2009-10-15 18:27:35

None of those issues could be the cause of this bug (I'm pretty sure the code base does any of that funky type-casting; all the customer's systems are 32-bit; it's a single-threaded app), but I have learned a few things from your reply, thanks!

onnodb 2009-10-15 18:45:33

+1 A:

Code optimization does not mean necessarily that debug symbols have to be left out. Do a debug build with code optimization, then you can still debug the program and maybe the error occurs now.

Ozan 2009-10-15 18:34:10

Good point, I forgot to mention that. Still, I was wondering what types of code could cause such a bug --- it's a big program, any hints as to what the cause could be, are welcome :o)

onnodb 2009-10-15 18:41:37

+5 A:

Could very well be a memory vs register issue: you programm running fine relying on memory persistence after a free.
I would recommend running your application with FastMM4 in full debug mode to be sure of your memory management.
Another (not free) tool which can be very useful in a case like this is Eurekalog.

Another thing that I've seen: a crash with the FPU registers being botched when calling some outside code (DLL, COM...) while with the debugger everything was OK.

François 2009-10-15 18:59:52

In such problems i always advice to use logfiles.

Question: Can you somehow determine the incorrect display in the sourcecode?

If not, my answer wont help you.

If yes, check for the incorrectness, and as soon as you find it, dump the stack to a logfile. (see post mortem debugging for details about dumping and resymbolizing the stack).

If you see that some data has been corrupted, but you dont know how and then this happend, extract a function that does such a test for validity (with logging if failed), and call this function from more and more places over program execution (i.e. after each menu call). If you reiterate such a approach a few times you have good chances to find the problem.

RED SOFT ADAIR 2009-10-15 19:09:40

+1 A:

One easy thing to do is Turn on compiler warning and hint, rebuild project and then fix all warnings/hints

Cheers

APZ28 2009-10-15 19:29:08

Compilers warnings and hints are always turned on in our project, but yes, that's generally a good advice!

onnodb 2009-10-16 06:22:26

+1 A:

Is this a local variable inside a procedure or function?

If so, then it lives on the stack, and will contain garbage. Depending on the execution path and compiler settings the garbage will change, potentially pushing your logic 'over the edge'.

--jeroen

Jeroen Pluimers 2009-10-15 19:44:52

+2 A:

If it Delphi businesscode, with dataaware components etc, the follow might not apply.

I'm however writing machine vision code which is a bit computational. Most of the unittests are console based. I also am involved with FPC, and over the years have tested a lot with FPC. Partially out of hobby, partially in desperate situations where I wanted any hunch.

Some standard tricks that I tried (decreasing usefulness)

use -gv and valgrind the code (practically this means applications are required to run on Linux/FreeBSD. But for computational code and unittests that can be doable)
compile using fpc param -gt (=trash local vars, randomize local vars on procedure init)
modify heapmanager to randomize data of blocks it puts out (also applyable to Delphi code)
Try FPC's range/overflow checking and compiler hints.
run on a Mac Mini (powerpc) or win64. Due to totally different rules and memory layouts it can catch pretty funky things.

The 2 and 3 together nearly allow you to find most, if not all initialization problems.

Try to find any clues, and then go back to Delphi and search more focussed, debug etc.

I do realize this is not easy. I have a lot of FPC experience, and didn't have to find everything out from scratch for these cases. Still it might be worth a try, and might be a motivation to start setting up non-visual systems and unittests FPC compatible and platform independant. Most of this work will be needed anyway, seeing the Delphi roadmap.

Marco van de Voort 2009-10-15 20:49:19

Very interesting reply, thanks!

onnodb 2009-10-16 06:23:39

Given your description of the problem I think you had uninitialized data that you got away with without the optimizer but which blew up with the optimization on.

Loren Pechtel 2009-10-15 21:19:13

+3 A:

A record that contains different data according to different compiler settings tells me one thing: That the record is not being explicitly initialised.

You may find that the setting of the compiler optimization flag is only one factor that might affect the content of that record - with any uninitialised data structures the one thing that you can rely on is that you can't rely on the initial content of the structure.

In simple terms:

class member data is initialised (to zero's) for new instances of the class
local variables (in functions and procedures) and unit variables are NOT initialised except in a few specific cases: interface references, dynamic arrays and strings and I think (but would need to check) records if they contain one or more fields of those types that would be initialised (strings, interface references etc).

The question as stated is now a little misleading because it seems you found your "Heisenberg" fairly easily enough. Now the issue is how to deal with it, and the answer is simply to explicitly initialise your record so that you aren't reliant on whatever behaviour or side-effect of the compiler is sometimes taking care of that for you and sometimes not.

Deltics 2009-10-16 05:18:37

ansaurus

tags:

views:

answers:

How to hunt a Heisenbug

Update

Solution

related questions