How do I track down a Heisenbug in some Python code?

views:

156

answers:

+6 Q:

How do I track down a Heisenbug in some Python code?

Quick background: we have a large source base written in Python. It is a compiler for a domain specific language, and internally everything is represented as directed graphs. These digraphs are built up from sets, and so we use the builtin set type in Python.

The problem is that we didn't originally realise that Python actively uses the lack of ordering guarantee in a set object to use a faster non-deterministic implementation. So when you iterate over the objects in a set (as we do frequently) the order they are returned in is weakly random. It doesn't change on every execution, but it does change frequently. From what I've seen debugging our code it seems as if a hash of the source code acts as a seed for the random number generator. So changing the code, even in a path that is not executed, causes the set iterator to change the order that elements are generated in.

Our basic debugging strategy is to dump prints into where we think the error is, refining them based on the output until we find the bug. Not very elegant but for most things it largely works. The basic problems is that adding/changing any print statement triggers different behaviour and a wildly different execution path. We can't afford to print/log everything as it slows down the compiler's execution time from about 20s (managable) to about an hour (not so managable).

How would you diagnose a problem that only occurs infrequently and disappears when you start to look for it?

Edit for clarification: Several answers suggest ways to fix the ordering of the sets. As torkildr says below "The problem here isn't that sets behave strangely, the problem is that your program behaves as if it doesn't". This is exactly the problem, but the solution is not to use deterministic sets. This would simply mask the behaviour. The problem is to find why our program behaves this way and fix that behaviour. The algorithms that we use should work on graphs represented as unordered sets. They don't. I need to find out where these bugs and why they occur.

Problem solved: It turns out that on the implementation of Python that I'm using (2.6 on OS-X) if the relationship between the __eq__ and __hash__ methods of the objects being stored in the set is not quite a valid ordering then the system exhibits the weakly random behaviour described. There must be some code in the C implementation of set.add() that uses something from the random module to build the representation. This causes a dependency on the system entropy pool which changes the ordering on disk writes.

No direct answers, but reading kriss' follow-up question caused the insight to solve this problem so he gets the vote.

+2 A:

I don't really see why you would want to look at the problem further when you obviously have an idea what the root of the problem is.

The main issue her is: You need deterministic, ordered graphs; Python sets doesn't give you this. Why wouldn't you change implementation to some list/set implementation that does?

When you say it changes whenever you change the code, this doesn't sound at all surprising, when you consider that the set algorithm probably is optimized to use the fastest possible storage pattern it see fit. In other words, you do some other operation, store a variable, print a line, and the memory changes.

edit: To sum it up; The problem here isn't that sets behave strangely, the problem is that your program behaves as if it doesn't.

torkildr 2010-09-28 10:37:44

..also, I'm suspecting you posted this, only so you could write Heisenbug ;)

torkildr 2010-09-28 10:38:21

Sounds to me like their code *shouldn't* have any dependencies on the order, but, due to a bug, does. However your suggestion is probably the best way to investigate/solve the problem.

Douglas Leeder 2010-09-28 10:42:41

Because we need the performance that sets give us, and if we write our own deterministic sets then it will a) be too slow, and b) take up a lot of development effort. Also, as Douglas points out above we need to fix the bug anyway.

Amoss 2010-09-28 11:30:26

+2 A:

Why not just change nothing in the code that affects set output ordering and use pdb instead of adding prints ? Does setting a breakpoint also change set ordering ? If not, pdb will allow you to inspect internal variables.

Your description of your problem also lead to some mysteries. How do you detect there is a bug ? If this detection can be done at run time a possible strategy would be to run pdb from your code (as simple as import pdb; pdb.set_trace()) as soon as you see it (using run time ifs), not in a subsequent execution after changing the code. This way you wouldn't have to change the code at all to debug (but maybe to remove those assertions later when the code will be debugged and rock strong).

By the way you should also unit test all your code when writing it, then it'll be much less likely to hide subtile bugs.

kriss 2010-09-28 10:58:12

I've had a quick look at the pdb docs and I don't see how to do this. The docs imply that you alter the code to insert a breakpoint by calling pdb.... have I misread, or just not understood?

Amoss 2010-09-28 11:33:58

You have read fine, but it seems in your case you'll just have to call pdb from another module (or even from python interactive) then import and run your module. What is unclear is if set behavior depends on source code or on memory state. By the way you could also use dictionnaries instead of set. For them the order is guaranteed non random (nd I'm quite surprised there is anything random in sets, but python documentation indeed does not say a thing on that and seems not to guarantee the order will be stable for two consecutive iterations on unaltered sets).

kriss 2010-09-28 11:45:30

If you have an executable module, you can use `python -m pdb mymodule` and PDB will load first, giving you the opportunity to install breakpoints. Most people will alter the code because it's simpler.

Jason R. Coombs 2010-09-28 11:45:57

I didn't find anything in the docs about randomness of sets so I guess it is implementation-specific / undefined behaviour. The best guess that I have so far is that it behaves as-if it were hashing the source although I don't think it does do that. The calls into the random module from set do show up under the profiler.

Amoss 2010-09-28 11:50:29

The detection of the bug comes after execution finishes (by me manually looking at some output). Do you know of a way to record a trace of a program execution so that the debugging can be done offline afterwards?

Amoss 2010-09-28 13:10:03

+2 A:

Raymond Hettinger has written a recipe for ordered sets. Its Big-O running times for all methods are the same as for sets. You might want to try changing all your sets over to ordersets.

unutbu 2010-09-28 11:40:05

ansaurus

tags:

views:

answers:

How do I track down a Heisenbug in some Python code?

related questions