views:

854

answers:

7

I've written an Apache module in C. Under certain conditions, I can get it to segfault, but I have no idea as to why. At this point, it could be my code, it could be the way I'm compiling the program, or it could be a bug in the OS library (the segfault happens during a call to dlopen()).

I've tried running through GDB and Valgrind with no success. GDB gives me a backtrace into the dlopen() system call that appears meaningless. In Valgrind, the bug actually seems to disappear or at least become non-reproducible. On the other hand, I'm a total novice when it comes to these tools.

I'm a little new to production quality C programming (I started on C many years ago, but have never worked professionally with it.) What is the best way for me to go about learning the ropes of debugging programs? What other tools should I be investigating? In summary, how do you figure out how to tackle new bug challenges?

EDIT: Just to clarify, I want to thank Sydius's and dmckee's input. I had taken a look at Apache's guide and am fairly familiar with dlopen (and dlsym and dlclose). My module works for the most part (it's at about 3k lines of code and, as long as I don't activate this one section, things seem to work just fine.)

I guess this is where my original question comes from - I don't know what to do next. I know I haven't used GDB and Valgrind to their full potential. I know that I may not be compiling with the exact right flags. But I'm having trouble figuring out more. I can find beginner's guides that tell me what I already know, and man pages that tell me more than I need to know but with no guidance.

+5  A: 

This link may help: Apache Debugging Guide with your specific problem. Experience with specific problems is one of the best ways to get better in the general case.

Sydius
+1  A: 

Very general advice:

  • Look again at that backtrace. Are any of the stack frames in code you control? If so, what line, and what is happening there?

  • Do you know what dlopen() does? If not read the manual. If the backtrace does not include any of you code, this may well be failing at the time Apache tries to load your code. Are you sure you've built the module with the right compiler options?

  • Effective debugging requires knowing your environment and tools. Sydius's advice is good here.

  • If you're stuck on other paths, check that you can write, load, and run a trivial module. Probably you'll find an example of this in almost any documentation on the subject.


To dave's clarification: Between beginner and expert can be a tough spot.

Are you calling in libraries in the offending code that you don't use elsewhere? Maybe the loader path is messed up just for that resource.

Aside from that I'm just about out of advice. Sorry.


NB: I had occasion to read David J. Agans' book Debugging last year. It is not software specific, but is a good read, and helpful even if you are already a pretty good debugger.

dmckee
I am anything bug a good debugger :) I'll take a look.
dave mankoff
+2  A: 

I'm sure that debugging techniques are in general language independent and there is no such think "C debugging".
There is a lot of different tool that can help you find simple problems like memory leak, or just stupid mistakes in the code, some times it even can catch simple memory overruns. But for real hard to find problems like problems originated from multitasking/interrupt, dma memory corruption the only tool is your brain and well written code (with thinking in advance that this code will be debugged). You can find more about preparing your code to debugging here. It seems from Sydius post that Apache already have a good tracing mechanism in place, so just use it and add simalar to your code base.
In additional i would say that another important step in debugging is "don't assume/think". Base all your steps on bare facts, prove all your assumption with 100% accuracy before you making another step based on that assumption. Basing your debugging on assumption usually will bring you to wrong direction.

Edit after Dave's clarification:

You next step should be find the smallest part of the code that cause the problem. You sad that if your disable certain section the module is loaded. just make this section as small is possible, remove/moke everything in the section until you will find ideally one line that will cause the module not to load. And after you find this line. it will be an exact time to start using your brain :) Just don't forget to 100% verify that this is the line.

Ilya
+3  A: 

Unfortunately the GNU tools are not the best, and my experience is that the dynamic linker muddies the waters enormously. If you can get Apache to link statically with your module that will enable gdb especially to perform more reliably. I don't know how easy that is; a lot depends on the Apache build system.

It's worrisome but not shocking that you can't easily reproduce the bug with valgrind.

Regarding compiling with the right flags, both valgrind and gdb will give you much better information if you compile everything in sight with -g -O0. Don't believe the claims on the gcc man page that gcc -g -O is good enough; it isn't---even -O will cause variables in the source code to be eliminated by the optimizer.

Norman Ramsey
Correction: While static linking is good for gdb, it is bad for valgrind. Valgrind documentation wants glibc to be dynamically linked. Sorry about that.
Norman Ramsey
gcc -g -O will annotate the (perhaps confusingly optimized) code with debug info. If the bug is somehow dependent on undefined behavior exploited by the optimizer, this is what you want. -O0 will, of course, be easier to follow.
puetzk
Yes, but code produced by gcc -g -O is not faithful to the original program. The worst problem is that variables disappear. You can still track line numbers, but from experience, gcc -g -O is much less useful than you might imagine from reading the man page (and question asked about options).
Norman Ramsey
A: 

I had a look at the valgrind documentation and by default it doesn't check child processes. It wouldn't surprise me at all if Apache could run your module in a child thread. Please try

valgrind --trace-children=yes ....
Norman Ramsey
A: 

Our non CS students (i.e. electrical engineering, math, physic students) I recommend in the programming lectures "The Practice of Programming" from Kernighan. It good delivers some basic concepts which aids devlopment (like testing and here it comes: debugging).

If you are already experienced programmer, it is maybe too basic for you. Then I have just one more of this Zen proverbs for you: "Wisdom withouth the filtering through experience is worthless".

One answer I can only back up: Look again at the stack trace, this is the most relevant help by debugging (esp there at the borders, where the execution crosses different modules (esp yours and the lib/OS borders), and look at the argument of the function and check if they are sane).

flolo
+1  A: 

The fact that it is failing on the dlopen() call seems a bit suspect to me. There are a number of things that can go wrong when attempting to open a shared object; but none of them should cause a seg fault.

The one exception I can think of is a problem in the library initialization of the SO. On that basis, I would suggest a few things you could try to get more information.

  • Check your library path, and ensure that the library you're trying to load is in this path. (Note: Since you're using Apache, I think you also need to check the library path for the user under which Apache is running. (I think the user is "nobody".) I believe you're looking for the LD_LIBRARY_PATH environment variable.) Also note that if you have multiple versions of the library, this can be really important. Make sure you're loading the correct version of the library.
  • As a general debugging principle, try to simplify the problem. Given that I know little about Apache modules, I would try to remove Apache from the equation: Try writing a simple C program that does little more than a dlopen() and possibly the subsequent dlsym(), then exits. This program provides a much simpler environment to troubleshoot and/or debug. If this program runs cleanly, then you may need to look more closely at what's different when the program seg faults. (What's Apache doing differently?) On the other hand, if your program also seg faults, you may consider a potential problem with the library, your compiler switches for the program, and the code in the program. (Or all of the above.)

While I may not have offered very many general purpose debugging tips, I hope something here may have been helpful.

Jedidiah Thomet