ansaurus

Question

Answer 1

+1 A:

That sounds like a stack overflow problem - something is writing beyond the bounds of an array and trampling over the stack frame (and probably the return address too) on the stack. There's a large literature on the subject. "The Shell Programmer's Guide" (2nd Edition) has SPARC examples that may help you.

Jonathan Leffler 2008-10-30 22:58:51

Answer 2

A:

Is something meaning to assign a value of 2 to a variable but instead is assigning its address to 2?

The other details are lost on me but "2" is the recurring theme in your problem description. ;)

John at CashCommons 2008-10-30 22:59:39

Answer 3

+3 A:

I had that exact problem today and was knee-deep in gdb mud and debugging for a straight hour before occurred to me that I simply wrote over array boundaries (where I didn't expect it the least) of a C array.

So, if possible, use vectors instead because any decend STL implementation will give good compiler messages if you try that in debug mode (whereas C arrays punish you with segfaults).

Konrad Rudolph 2008-10-30 23:02:31

Answer 4

A:

I would second that this definitely sounds like a stack corruption due to out of bound array or buffer writing. Stack protector would be good as long as the writing is sequential, not random.

Franci Penov 2008-10-30 23:06:41

Answer 5

+2 A:

I'm not sure what you're calling a "frame pointer", as you say:

On actual execution of that instruction, we end up at program counter 0x000002

Which makes it sound like the return address is being corrupted. The frame pointer is a pointer that points to the location on the stack of the current function call's context. It may well point to the return address (this is an implementation detail), but the frame pointer itself is not the return address.

I don't think there's enough information here to really give you a good answer, but some things that might be culprits are:

incorrect calling convention. If you're calling a function using a calling convention different from how the function was compiled, the stack may become corrupted.
RAM hit. Anything writing through a bad pointer can cause garbage to end up on the stack. I'm not familiar with Solaris, but most thread implementations have the threads in the same process address space, so any thread can access any other thread's stack. One way a thread can get a pointer into another thread's stack is if the address of a local variable is passed to an API that ultimately deals with the pointer on a different thread. unless you synchronize things properly, this will end up with the pointer accessing invalid data. Given that you're dealing with a "simple signal implementation", it seems like it's possible that one thread is sending a signal to another. Maybe one of the parameters in that signal has a pointer to a local?

Michael Burr 2008-10-30 23:09:10

Answer 6

A:

I second the notion that it is likely stack corruption. I'll add that the switch to a multi-threaded library makes me suspicious that what has happened is a lurking bug has been exposed. Possibly the sequencing the buffer overflow was occurring on unused memory. Now it's hitting another thread's stack. There are many other possible scenarios.

Sorry if that doesn't give much of a hint at how to find it.

Steve Fallows 2008-10-30 23:21:22

Answer 7

+9 A:

Stack corruption, 99.9% definitely.

The smells you should be looking carefully for are:-

Use of 'C' arrays
Use of 'C' strcpy-style functions
memcpy
malloc and free
thread-safety of anything using pointers
Uninitialised POD variables.
Pointer Arithmetic
Functions trying to return local variables by reference

Roddy 2008-10-30 23:36:39

Answer 8

+1 A:

With C++ unitialized variables and race conditions are likely suspects for intermittent crashes.

postfuturist 2008-10-30 23:38:31

Answer 9

+1 A:

Is it possible to run the thing through Valgrind? Perhaps Sun provides a similar tool. Intel VTune (Actually I was thinking of Thread Checker) also has some very nice tools for thread debugging and such.

If your employer can spring for the cost of the more expensive tools, they can really make these sorts of problems a lot easier to solve.

Zan Lynx 2008-10-31 00:03:56

Answer 10

A:

I tried Valgrind on it, but unfortunately it doesn't detect stack errors:

"In addition to the performance penalty an important limitation of Valgrind is its inability to detect bounds errors in the use of static or stack allocated data."

I tend to agree that this is a stack overflow problem. The tricky thing is tracking it down. Like I said, there's over 100,000 lines of code to this thing (including custom libraries developed in-house - some of it going as far back as 1992) so if anyone has any good tricks for catching that sort of thing, I'd be grateful. There's arrays being worked on all over the place and the app uses OI for its GUI (if you haven't heard of OI, be grateful) so just looking for a logical fallacy is a mammoth task and my time is short.

Also agreed that the 0x000002 is suspect. It is about the only constant between crashes. Even weirder is the fact that this only cropped up with the multi-threaded switch. I think that the smaller stack as a result of the multiple-threads is what's making this crop up now, but that's pure supposition on my part.

No one asked this, but I built with gcc-4.2. Also, I can guarantee ABI safety here so that's also not the issue. As for the "garbage at the end of the stack" on the RAM hit, the fact that it is universally 2 (though in different places in the code) makes me doubt that as garbage tends to be random.

2008-10-31 00:49:40

Answer 11

+1 A:

It's not hard to mangle the frame pointer - if you look at the disassembly of a routine you will see that it is pushed at the start of a routine and pulled at the end - so if anything overwrites the stack it can get lost. The stack pointer is where the stack is currently at - and the frame pointer is where it started at (for the current routine).

Firstly I would verify that all of the libraries and related objects have been rebuilt clean and all of the compiler options are consistent - I've had a similar problem before (Solaris 2.5) that was caused by an object file that hadn't been rebuilt.

It sounds exactly like an overwrite - and putting guard blocks around memory isn't going to help if it is simply a bad offset.

After each core dump examine the core file to learn as much as you can about the similarities between the faults. Then try to identify what is getting overwritten. As I remember the frame pointer is the last stack pointer - so anything logically before the frame pointer shouldn't be modified in the current stack frame - so maybe record this and copy it elsewhere and compare upon return.

Richard Harrison 2008-10-31 01:30:50

Answer 12

+3 A:

There's some confusion here between stack overflow and stack corruption.

Stack Overflow is a very specific issue cause by try to use using more stack than the operating system has allocated to your thread. The three normal causes are like this.

void foo()
{
  foo();  // endless recursion - whoops!
}

void foo2()
{
  char myBuffer[A_VERY_BIG_NUMBER];  // The stack can't hold that much.
}

class bigObj
{
  char myBuffer[A_VERY_BIG_NUMBER];  
}

void foo2( bigObj big1)  // pass by value of a big object - whoops!
{
}

In embedded systems, thread stack size may be measured in bytes and even a simple calling sequence can cause problems. By default on windows, each thread gets 1 Meg of stack, so causing stack overflow is much less of a common problem. Unless you have endless recursion, stack overflows can always be mitigated by increasing the stack size, even though this usually is NOT the best answer.

Stack Corruption simply means writing outside the bounds of the current stack frame, thus potentially corrupting other data - or return addresses on the stack.

At it's simplest:-

void foo()
{ 
  char message[10];

  message[10] = '!';  // whoops! beyond end of array
}

Roddy 2008-10-31 17:16:21

Answer 13

A:

Also agreed that the 0x000002 is suspect. It is about the only constant between crashes. Even weirder is the fact that this only cropped up with the multi-threaded switch. I think that the smaller stack as a result of the multiple-threads is what's making this crop up now, but that's pure supposition on my part.

If you pass anything on the stack by reference or by address, this would most certainly happen if another thread tried to use it after the first thread returned from a function.

You might be able to repro this by forcing the app onto a single processor. I don't know how you do that with Sparc.

MSN

MSN 2008-10-31 20:52:57

Answer 14

A:

It is impossible to know, but here are some hints that I can come up with.

In pthreads you must allocate the stack and pass it to the thread. Did you allocate enough? There is no automatic stack growth like in a single threaded process.
If you are sure that you don't corrupt the stack by writing past stack allocated data check for rouge pointers (mostly uninitialized pointers).
One of the threads could overwrite some data that others depend on (check your data synchronisation).
Debugging is usually not very helpful here. I would try to create lots of log output (traces for entry and exit of every function/method call) and then analyze the log.
The fact that the error manifest itself differently on Linux may help. What thread mapping are you using on Solaris? Make sure you map every thread to it's own LWP to ease the debugging.

lothar 2009-04-13 01:17:24

ansaurus

tags:

views:

answers:

What can modify the frame pointer?

related questions