views:

347

answers:

7

I have a bug in a multi-processes program. The program receives input and instantly produces output, no network involved, and it doesn't have any time references. What makes the cause of this bug hard to track down is that it only happens sometimes.

If I constantly run it, it produces both correct and incorrect output, with no discernible order or pattern.

What can cause such non-deterministic behavior? Are there tools out there that can help? There is a possibility that there are uninitialized variables in play. How do I find those?

EDIT: Problem solved, thanks for anyone who suggested Race Condition. I didn't thought of it mainly because I was sure that my design prevents this. The problem was that I've used 'wait' instead of 'waitpid', thus sometimes, when some process was lucky enough to finish before the one I was expecting, the correct order of things went wild.

+2  A: 

It could be a lot of things, memory leaks, critical sections access, unclosed resources, unclosed connection and etc. There is only one tool which can help you - DEBUGGER, or try examine your algorithm and find bug, or if you succeeded to point the problematic part, you can paste here a snippet and we will try to help you.

Artem Barger
Good list of suspects, not so good listing only one tool. Consider also a valgrind, error checking malloc replacements (like efence), etc
dmckee
+1  A: 

We'd need to see specifics about your code to be able to give a more accurate answer, but to be concise, when you have a program that coordinates between multiple processes or multiple threads, the variable of when the threads execute can add indeterminacy to your application. Essentially, the scheduling that the OS does can cause processes and threads to execute out-of-order. Depending on your environment and code, the scheduling that the OS does can cause wildly different results. You can search on google for more information about out-of-order execution with multithreading for more information; it's a large topic.

McWafflestix
+5  A: 

You say it's a "multi-processes" program - could you be more specific? It may very well be a race condition in how you're handling the multiple processes.

If you could tell us more about how the processes interact, we might be able to come up with some possibilities. Note that although Artem's suggestion of using a debugger is fine in and of itself, you need to be aware that introducing a debugger may very well change the situation completely - particularly when it comes to race conditions. Personally I'm a fan of logging a lot, but even that can change the timing subtly.

Jon Skeet
A: 

By "multi-process" do you mean multi-threaded? If we had two threads that do this routine

i = 1;
while(true)
{
    printf(i++);
    if(i > 4) i = 1;
}

Normally we'd expect the output to be something like

112233441122334411223344

But actually we'd be seeing something like

11232344112233441231423

This is because each thread would get to use the CPU at different rates. (There's a whole lot of complicated behind the scheduling schedule, and I'm too weak to tell you the technical stuffs behind it.) Suffice to say, the scheduling from the average person's point of view is pretty random.

This is an example of race condition mentioned in other comments.

kizzx2
+3  A: 

The scheduler!

Basically, when you have multiple processes, they can run in any bizarre order they want. If those processes are sharing a resource that they are both reading and writing from (whether it be a file or memory or an IO device of some sort), ops are going to get interleaved in all sorts of weird orders. As a simple example, suppose you have two threads (they're threads so they share memory) and they're both trying to increment a global variable, x.

y = x + 1;
x = y

Now run those processes, but interleave the code in this way

Assume x = 1

P1:

y = x + 1

So now in P1, for variable y which is local and on the stack, y = 2. Then the scheduler comes in and starts P2

P2:

y = x + 1
x = y

x was still 1 coming into this, so 1 has been added to it and now x = 2

Then P1 finishes

P1:

x = y

and x is still 2! We incremented x twice but only got that once. And because we don't know how this is going to happen, it's referred to as non-deterministic behavior.

The good news is, you've stumbled upon one of the hardest problems in Systems programming as well as the primary battle cry of many of the functional language folks.

Alex Gartrell
Good example, thanks.
Liran Orevi
+3  A: 

You're most likely looking at a race condition, i.e. an unpredictable and therefore hard to reproduce and debug interaction between improperly synchronized threads or processes.

The non-determinism in this case stems from process/thread and memory access scheduling. This is unpredictable because it is influenced by a large number of external factors, including network traffic and user input which constantly cause interrupts and lead to different actual sequences of execution in the program's threads each time it's run.

Michael Borgwardt
+1  A: 

Start with the basics... make sure that all your variables have a default value and that all dynamic memory is zeroed out before you use it (i.e. use calloc rather than malloc). There should be a compiler option to flag this (unless you're using some obscure compiler).

If this is c++ (I know it's supposed to be a 'c' forum), there are times were object creation and initialization lags behind variable assignment that can bite you. For example if you have a scope that is used concurrently by multiple threads (as in a singleton or a global var) this can cause issues:

if (!foo) Foo tmp = new Foo();

If you have multiple threads access the above, the first thread finds foo = null and starts the object creation and assignment and then yields. Another thread comes in and finds foo != null so skips the section and starts to use foo.

Jato