views:

152

answers:

5

We have been debugging a strange case for some days now, and have somewhat isolated the bug, but it still doesn't make any sense. Perhaps anyone here can give me a clue about what is going on.

The problem is an access violation that occur in a part of the code.

Basically we have something like this:

void aclass::somefunc() {
  try {
    erroneous_member_function(*someptr);
  } 
  catch (AnException) {
  }
}

void aclass::erroneous_member_function(const SomeObject& ref) {
  // { //<--scope here error goes away
  LargeObject obj = Singleton()->Object.someLargeObj; //<-remove this error goes away

  //DummyDestruct dummy1//<-- this is not destroyed before the unreachable

  throw AnException();

  // } //<--end scope here error goes away 

  UnreachableClass unreachable; //<- remove this, and the error goes away

  DummyDestruct dummy2; //<- destructor of this object is called! 
}

While in the debugger it actually looks like it is destructing the UnreachableClass, and when I insert the DummyDestruct object this does not get destroyed before the strange destructor are called. So it is not seem like the destruction of the LargeObject is going awry.

All this is in the middle of production code, and it is very hard to isolate it to a small example.

My question is, does anyone have a clue about what is causing this, and what is happening? I have a quite full featured debugger available (Embarcadero RAD studio), but now I am not sure what to do with it.

Can anyone give me some advise on how to proceed?

Update:

I placed a DummyDestruct object beneath the throw clause, and placed a breakpoint in the destructor. The destructor for this object is entered (and its only us is in this piece of code).

+1  A: 

With the information you have provided, and if everything is as you state, the only possible answer is a bug in the compiler/optimizer. Just add the extra scope with a comment (This is, again, if everything is exactly as you have stated).

David Rodríguez - dribeas
I have though of the same thing, but I cannot accept this. If such a gross error is in the compiler it would have been noticed. (We are not the only company in the world using BCC right?). The situation is like provided, but as I say, this is in a large project, and many things can have gone wrong earlier. Overwriting of memory of all sorts, the largeObject copied might be completly invalid/overwritten.
daramarak
Is there any possibility the Singleton() memory is the one that is overwritten / messed up with ? Is anyone extending the Singleton or something ?
phtrivier
@daramarak: *We are not the only company in the world using BCC right?* I've never seen or heard of BCC used (here in europe). it's either MSVS or GCC.
just somebody
@just somebody: I knew it! We are the only one in the world using it, that explains quite a lot :)
daramarak
@phtrivier: The singleton function, accesses a static variable inside the function, and cannot be extended as it can only be constructed by this function. So I cannot see that this is possible.
daramarak
Well, it turns out that you are right. After reducing this to a 20 line program we are able to produce this bug. It also turns out that turning the -Od option off during build removes this *feature*. It is now reported to Embarcadero as a bug.
daramarak
+1  A: 

Stuff like this sometimes happens due to writing through uninitialized pointers, out of bounds array access, etc. The point at which the error is caused may be quite removed from the place where it manifests. However, based on the symptoms you describe it seems to be localized in this function. Could the copy constructor of LargeObject be misbehaving? Is ref being used? Perhaps somePtr isn't pointing to a valid SomeObject. Is Singleton() returning a pointer to a valid object? Compiler error is also a possibility, especially with aggressive optimization turned on. I would try to recreate the bug with no optimizations.

Ari
LargeObject does not have any constructors or destructors other than those provided by the compiler. The singleton returns a copy of a valid object (but it might be corrupted). But I cannot see how the corrupted might do anything wrong during this piece of code, espesially when it seems like it doesn't even get created, only placed on the stack.
daramarak
Is `->` overloaded for the type returned by `Singleton()`? If it is, it might be causing the trouble.
Ari
What about the constructor of `AnException`? BTW, is this legal code? Based on the capital letters I'm guessing `AnException` is a class name not an object.
Ari
Ari: AnException is the constructor yes, typo. I fixed it.And yes, the constructor of the exception may be the culprit. I have looked at the code. It is not very complex, but not trivial either. I might have another look at it.
daramarak
Have you tried Ari's suggestion to recreate with optimization turned off?
Mutmansky
It was compiled with -Od and that is actually needed to see the bug, running with -O2 or no O at all actually removes, or much worse, hides the bug.
daramarak
+1  A: 

Time to practice my telepathic debugging skills:

My best guess is your application has a stack corruption bug. This can write junk over the call stack, which means the debugger is incorrectly reporting the function when you break, and it's not really in the destructor. Either that or you are incorrectly interpreting the debugger's information and the object really is being destructed correctly, but you don't know why!

If stack corruption is the case you're going to have a really tough time working out what the root cause is. This is why it's important to implement tonnes of diagnostics (eg. asserts) throughout your program so you can catch the stack corruption when it happens, rather than getting stuck on its weird side effects.

AshleysBrain
Stack corruption is the hypothesis I have been following. I have been inspecting the stack before and after the exception, looking for throwing destructors in the objects created and even tracing the code through the CPU view, to find if something is misbehaving. But you suggest that the whole debugger is deceving me when I step through the code, and that the functions I see get called really doesn't get called at all?
daramarak
No, I wouldn't go that far... all I'm saying is if stack corruption has already happened, the debugger might show incorrect information in the call stack.
AshleysBrain
I am not relying only on the call stack, I have been stepping through the code, so I think I can at least trust where what gets run and what doesn't.
daramarak
My problem is this: If the stack is corrupted. It must be corrupted somewhere from the try scope to the throw call. All other corruption should not mess with the stack unwinding. If the stack gets corrupted it should be caused by the code within the scope. This is a call to the singleton, and some copy constructors right? The copy constructors are provided by the compiler, the singleton call is trivial. Then we have the destructors, which doesn't even get called if the stack unwinds in the way I think (last declared first destroyed). Or are my assumptions wrong?
daramarak
I don't know - telepathic debugging is kind of hard. I think it's unlikely you'll solve your problem like this, since "weird, unexplainable stuff is happening, why?" is a hard question to answer. I'd focus on the last part of my answer: "This is why it's important to implement tonnes of diagnostics (eg. asserts) throughout your program so you can catch the stack corruption when it happens, rather than getting stuck on its weird side effects."
AshleysBrain
I agree that telepathic debugging is hard. Kudos for you trying. And stack corruption is weird, I have come across it many times. But this time it doesn't seem to add up. I though that the scope of the error was set by the try clause and the throw. But somehow it doesn't.
daramarak
A: 

This might be a real long shot but I'm going to put it out there anyway...

You say you use borland - what version? And you say you see the error in a string - STL? Do you include winsock2 at all in your project?

The reason I ask is that I had a problem when using borland 6 (2002) and winsock - the header seemed to mess up the structure packing and meant different translation units had a different idea of the memory layout of std::string, depending on what headers were included by the translation unit, with predictably disastrous results.

markh44
It would not surprise me if something like this was the problem. But I am sorry to say that winsock2 is not used in our code.
daramarak
A: 

Here's another wild guess, since you mentioned strings. I know of at least one implementation where (STL) string copying is done in a lazy manner (i.e., no actual copying of the string contents takes place until a change is made; the "copying" is done by simply having the target string object point to the same buffer as the source). In that particular implementation (GNU) there is a bug whereby excessive copying causes the reference counter (how many objects are using the same actual string memory after supposedly copying it) to roll over to 0, resulting in all sorts of mischief. I haven't encountered this bug myself, but have been told about it by someone who has. (I say this because one would think that the ref counter would be a 32 bit number and the chances of that ever rolling over are pretty slim, to say the least, so I may not be describing the problem properly.)

Ari
It doesn't sound like this is our problem. Especially since the string deallocated seems to be in an unused portion of the stack above all stack objects that have been allocated in the scope.
daramarak