views:

359

answers:

7

Hello,

I have detected a memory corruption in my embedded environment (my program is running on a set top box with a proprietary OS ). but I couldn't get the root cause of it. the memory corruption , itself, is detected after a stress test of launching and exiting an application multiple times. giving that I couldn't set a memory break point because the corruptued variable is changing it's address every time that the application is launched, is there any idea to catch the root cause of this corruption?

(A memory break point is break point launched when the environment change the value of a giving memory address)

note also that all my software is developed using C language.

Thanks for your help.

A: 

You detect memory corruption. Could you be more specific how? Is it a crash with a core dump, for example?

Normally the OS will completely free all resources and handles your program has when the program exits, gracefully or otherwise. Even proprietary OSes manage to get this right, although its not a given.

So an intermittent problem could seem to be triggered after stress but just be chance, or could be in the initialisation of drivers or other processes the program communicates with, or could be bad error handling around say memory allocations that fail when the OS itself is under stress e.g. lazy tidying up of the closed programs.

Printfs in custom malloc/realloc/free proxy functions, or even an Electric Fence -style custom allocator might help if its as simple as a buffer overflow.

Will
In fact the memory corruption is detected in a structure . So what I 'm doing is watching this structure when it 's created and filled in. In this step every thing seems to be good. but when a another process look to this structure it get a corrupted data that 's why I have concluded that there is another process writing in this location . I mention also that the application is running over an interpreter and even if the application crash the interpreter manage it very well. notice also that the problem is systematic so I don't think that 's chance.
Mohamed
C, interpreted, and shared memory? I think you need to add a whole lot more detail to your original question!
Will
+2  A: 

These are always difficult problems on embedded systems and there is no easy answer. Some tips:

  • Look at the value the memory gets corrupted with. This can give a clear hint.
  • Look at datastructures next to your memory corruption.
  • See if there is a pattern in the memory corruption. Is it always at a similar address?
  • See if you can set up the memory breakpoint at run-time.
  • Does the embedded system allow memory areas to be sandboxed? Set-up sandboxes to safeguard your data memory.

Good luck!

Matthijs P
A: 

Use memory-allocation debugging tools like ElectricFence, dmalloc, etc - at minimum they can catch simple errors and most moderately-complex ones (overruns, underruns, even in some cases write (or read) after free), etc. My personal favorite is dmalloc.

jesup
+2  A: 

Yes these problems can be tough to track down with a debugger.

A few ideas:

  • Do regular code reviews (not fast at tracking down a specific bug, but valuable for catching such problems in general)
  • If your architecture allows you to easily disable certain processes/tasks from running, by the process of elimination perhaps you can narrow down which process is causing the bug.
  • If your OS is a cooperative multitasking e.g. round robin (this would be too hard I think for preemptive multitasking): Add code to the end of the task that "owns" the structure, to save a "check" of the structure. That check could be a memcpy (if you have the time and space), or a CRC. Then after every other task runs, add some code to verify the structure compared to the saved check. This will detect any changes.
Craig McQueen
+2  A: 

Where is the data stored and how is it accessed by the two processes involved?

If the structure was allocated off the heap, try allocating a much larger block and putting large guard areas before and after the structure. This should give you an idea of whether it is one of the surrounding heap allocations which has overrun into the same allocation as your structure. If you find that the memory surrounding your structure is untouched, and only the structure itself is corrupted then this indicates that the corruption is being caused by something which has some knowledge of your structure's location rather than a random memory stomp.

If the structure is in a data section, check your linker map output to determine what other data exists in the vicinity of your structure. Check whether those have also been corrupted, introduce guard areas, and check whether the problem follows the structure if you force it to move to a different location. Again this indicates whether the corruption is caused by something with knowledge of your structure's location.

You can also test this by switching data from the heap into a data section or visa versa.

If you find that the structure is no longer corrupted after moving it elsewhere or introducing guard areas, you should check the linker map or track the heap to determine what other data is in the vicinity, and check accesses to those areas for buffer overflows.

You may find, though, that the problem does follow the structure wherever it is located. If this is the case then audit all of the code surrounding references to the structure. Check the contents before and after every access.

To check whether the corruption is being caused by another process or interrupt handler, add hooks to each task switch and before and after each ISR is called. The hook should check whether the contents have been corrupted. If they have, you will be able to identify which process or ISR was responsible.

If the structure is ever read onto a local process stack, try increasing the process stack and check that no array overruns etc have occurred. Even if not read onto the stack, it's likely that you will have a pointer to it on the stack at some point. Check all sub-functions called in the vicinity for stack issues or similar that could result in the pointer being used erroneously by unrelated blocks of code.

Also consider whether the compiler or RTOS may be at fault. Try turning off compiler optimisation, and failing that inspect the code generated. Similarly consider whether it could be due to a faulty context switch in your proprietary RTOS.

Finally, if you are sharing the memory with another hardware device or CPU and you have data cache enabled, make sure you take care of this through using uncached accesses or similar strategies.

ukembedded
A: 

A proprietary OS might limit your options a bit. One thing you might be able to do is run the problem code on a desktop machine (assuming you can stub out the hardware-specific code), and use the more-sophisticated tools available there (i.e. guardmalloc, electric fence).

The C library that you're using may include some routines for detecting heap corruption (glibc does, for instance). Turn those on, along with whatever tracing facilities you have, so you can see what was happening when the heap was corrupted.

Mark Bessey
A: 

I'm assuming by your question you mean that you suspect some part of the proprietary code is causing the problem.

I have dealt with a similar issue in the past using what a colleague so tastefully calls a "suicide note". I would allocate a buffer capable of storing a number of copies of the structure that is being corrupted. I would use this buffer like a circular list, storing a copy of the current state of the structure at regular intervals. If corruption was detected, the "suicide note" would be dumped to a file or to serial output. This would give me a good picture of what was changed and how, and by increasing the logging frequency I was able to narrow down the corrupting action.

Depending on your OS, you may be able to react to detected corruption by looking at all running processes and seeing which ones are currently holding a semaphore (you are using some kind of access control mechanism with shared memory, right?). By taking snapshots of this data too, you perhaps can log the culprit grabbing the lock before corrupting your data. Along the same lines, try holding the lock to the shared memory region for an absurd length of time and see if the offending program complains. Sometimes they will give an error message that has important information that can help your investigation (for example, line numbers, function names, or code offsets for the offending program).

If you feel up to doing a little linker kung fu, you can most likely specify the address of any statically-allocated data with respect to the program's starting address. This might give you a consistent-enough memory address to set a memory breakpoint.

Unfortunately, this sort of problem is not easy to debug, especially if you don't have the source for one or more of the programs involved. If you can get enough information to understand just how your data is being corrupted, you may be able to adjust your structure to anticipate and expect the corruption (sometimes needed when working with code that doesn't fully comply with a specification or a standard).

bta