views:

541

answers:

10

Recently, our big project began crashing on unhandled division by zero. No recent code seems to contain any likely elements so it may be new data sets affecting old code. The problem is the code base is pretty big, and running on an embedded device with no comfortable debug access (debug is done by a lot of printf()s over serial console, there is no gdb for the device and even if there was, the binary compiled with debug symbols wouldn't fit).

The most viable way would likely be to find all the division operations (they are relatively infrequent), and analyze code surrounding each of them to see if any of the divisor variables was left unguarded.

The question is then either how to find all division operations in a big (~200 files, some big) C++ project, or, if you have a better idea how to locate the error, please give them.

extra info: project runs on embedded ARM9, a small custom Linux distro, crosscompiled with Cygwin/Windows crosstools, IDE is Eclipse but there's also Cygwin with all the respective goodies. Thing is the project is very hardware-specific, and the crashes occur only when running at full capacity, all the essential interconnected modules active. Restricted "fault mode" where only bare bones are active doesn't create them.

+7  A: 

Finding all of the divisions shouldn't be hard with a custom grep search. You can easily distinguish that usage from other usages of the / and % character in C++.

Also, if you know what you are dividing, you could globally overload the / and % operator to have a __FILE__ and __LINE__ informing assertion. If using a makefile, it shouldn't be hard to include the custom operator code in all the linked files without touching the code.

Kornel Kisielewicz
gcc-xml doesn't dump function bodies.
Georg Fritzsche
Don’t forget modulus operations!
Konrad Rudolph
If he uses this technique he probably also should look for '%' (damn, Konrad beat me to it)
Webinator
All true, thanks for the comments.
Kornel Kisielewicz
What would be the regexp to search for all / but not /* nor */ nor // ?
SF.
@SF. `[^\*]\/[^\/\*]` or something like that, don't remember the grep syntax.
Kornel Kisielewicz
Another approach might be to search the binary for the division opcodes, and then work backwards to the source code using the map file.
Adrian McCarthy
Adrians idea is even better, because it will not find divisions like `x / 42` (Optimizer transforms those into multiplications).
MSalters
+6  A: 

You should use this as an excuse to invest in improving the debug-ability of your device - for both this problem and future issues. Even if you can't get live debugging, you should be able to find a way to generate and save off core dumps for post-mortem debugging (pinpointing the source or any unhandled exception immediately).

Terry Mahaffey
These devices are often seriously constrained by hardware requirements. Improving the debugability is not usually feasible.
David Thornley
Using a dump could help find one division bug. Examining all the divisions in the product might find more.
Adrian McCarthy
Sadly I never had much luck with emailing (e.g.) major mobile phone handset manufacturers and telling them "you seem not to have implemented the debug stub on your pre-release development boards, I assume it'll be fine if we down tools until you get around to it?". Turns out they never got around to it, handsets went to manufacture still without debugging.
Steve Jessop
If it's a major mobile phone handset manufacturer they should have a simulator, no?
Terry Mahaffey
A: 

The only way to find those conditions is the usual:

  1. try to reproduce the problem without looking at the source (as the bug already happened you should have info on the part of the program that is affected)
  2. if found, check the source for this point and fix it, otherwise:
    2.1. grep for each / not followed by a * or / (grep "/[^/*]" i think)
    2.2. find the conditions for which the code is executed and reproduce it
dbemerlin
+8  A: 

I think the most direct step, would be to try to catch the unhandled exception and generate a dump or printf stack information or similar.

Take a look at this question or just search in google for info relating to exception catching in your particular environment.

By the way, I think that the division could happen as a result of a call to an external library, so it's not 100% sure that you'll find the culprit just by greping your code.

David Alfonso
Not in an embedded environment, and C++ doesn't necessarily throw an exception on a divide by zero.
David Thornley
The exception may be trapped by Linux, in which case it is probably generating a SIGFPE signal or similar which can be caught
Hasturkun
The ARM architecture can support IEEE754 exceptions, see FPEXC register.
MSalters
+8  A: 

If I remember right, the ARM9 doesn't have hardware divide so it's going to be implemented in a function call the compiler makes whenever it has to perform a division.

See if your toolset implements the divide by zero handling in the same way as ARM's toolset does (it's likely that it does something at least similar). If so, you can install a handler that gets called when the problem occurs and you can printf() registers and stack so that you can determine where the problem is occurring. A possible similar alternative is that your small Linux distro is throwing a signal you can catch.

I'm not sure how you're getting your information that a divide by zero is occurring, but if it's because the runtime is spitting out a message to that effect, you always have the option of finding out where that is handled in the runtime, and replacing it with your own more informative message. However, I'd guess that there's a more 'architected' way to get your code to run (a signal handler or ARM's technique).

Michael Burr
+2  A: 

PC-Lint might help, it's like Findbugs for C++. It is a commercial product but there is a 30 money back guarantee.

Paul
Platforms: Windows 95/NT, DOS, OS/2. Not Linux embedded.
MSalters
@MSalters: PC-Lint is a *source code* analysis tool. You don't have to run it on the embedded device.
Craig McQueen
+1  A: 

Use the -save-temps for gcc and find the relevant assembly for division in the generated .s file. If you're lucky it will be something fairly distinctive, possibly even a function call. If it's a function call you can use weak linking to override it with your own checked version. Otherwise locating the divisions in the assembly should give you a very good idea where they are in the C/C++ code and you can instrument them directly.

Dan Olson
+2  A: 

Handle the exception.

Usually the exception will be handed a structure that contains the address that caused the exception and other information. You will probably have to become familiar with the microcontroller's datasheet or RTOS manual.

Robert
A: 

The exception already has the address location of the offending divide by zero code. The CPU saves register contents when a exception occurs including the PC(program counter). Your OS should pass this information along (I assumes that is how you know it is divide by zero). Print the address and go look in your code. If you can print a stack trace it would be even easier to solve.

Another option would be to check the differences in your version control software between the last know working version and the first non working version. This should give you a limmited change set within which to search for the problem.

Gerhard
+1  A: 

usually you could modify/override the divide-by-zero exception handler if you have access to the exception handler routines. in case of ARM, the division is done by a library routine. and there are mechanisms to inform the user-code, when a divide by zero occurs.

see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4061.html

i would suggest to provide a __rt_raise() as said in the page above.

__rt_raise(2,2) will get called when the divide routine detects a divide by zero. so you can print the LR register. and then use addr2line to crossref it against the source line

alvin