views:

262

answers:

5

I have an Animal base class that I use in some simulation programs. In this case there may be up to 500 Animals in a run. Each run is set up to have each Animal do something each "time step". So I just loop through the list of animals , call DoTimeStep on each one until all the time steps are done for the run.

Each Animal class has its own logger class to write out data for each "time step" in the simulation. This way each Animal has its own log file. It worked fine forever (3 years), until we tried to run it on a virtual machine. Then for what ever reason every once and a while the logger reference will be null for a "time step" and then the next time it will be there. The really strange part is the StreamWriter inside the logger never seems to lose track of where its file is. It just skips writing out the line for that time step. And the error log shows a NullReferenceException on the Logger class.

I can not find any pattern for this behavior. The Animal class was not destroyed and recreated. The logger is created in the Animal constructor and destroyed in IDispose. Any ideas on how I would start to debug this issue?

Edit: I can recreate this will only 3 animals, so the 500 open files should not be it. But thanks for trying.

Edit: I am not sure what I am supposed to do when I catch the error for the Null Exception . I was already catching it, but I can not figure how to find out why it is happening. Sorry for seeming obtuse. As an aside I did try the Thread.Sleep(300) for 10000 loops to see if there was some sort of race going on that I was unaware of. It never became not null in the loop. But 3 seconds later when I had cycled through the other two animals and came back it was no longer null.

A: 

Is there a chance you're running into multithreading issues, possibly related to lazy initialization?

Hank Gay
+2  A: 

Sounds like a race condition... Are you locking data shared between threads properly?

EDIT: If it is not a race condition, my seconds guess would be that maybe the virtual machine doesn't like having 500 files open at the same time... Have you looked into that?

DrJokepu
Interesting point with the VM not allowing 500 files open.
siz
+5  A: 

I would do the following.

  1. Setup VS to run your program under the debugger
  2. Enable First Chance Exceptions for a Null Reference Exception
  3. Debugger -> Exceptions -> Expand Common Language Runtime -> Check thrown for System.NullReferenceException
  4. Start the program
  5. Wait

If it takes a long time to repro I would start it and go home for the night. It will be waiting for you in the morning ;)

JaredPar
A: 

Sounds like a race condition. The easiest patch fix I've come up with to solve this kind of problem is:

Find where the null variable is being assigned. Whatever it's getting it's data from is occasionally supplying null, right?

So check right there, if it returns null, sleep for like 300ms then try again until it's non-null.

If it fails for more than, say 10 seconds, bail out with an error. Don't let it continue in an invalid state.

Bill K
A: 

It could be something related to the VM... I have seen similar memory issues with VMWare with native c++ applications where for no reason at all the memory gets messed up, and IT says they weren't doing anything to the VMs... (and we all know that means they were doing something to the VM)

I have seen weird behavior when the VM is actively being moved from one server to another while not being powered down (it basically repeats a few bytes of code twice which causes all kinds of interesting errors, especially if the resources were closed first).

But anyway if it only is reproducible on the VM, I would try to isolate it on a dev VM server by itself using dedicated resources (no sharing or fractional processors or anything) and see if you can reproduce it. Then go from there... to re-adding back the same type of environment until you can reproduce it.

uzbones