views:

838

answers:

4

I have a server application running under Cent OS. The server answers many requests per second but it repeatedly crashes after each hour or so and creates a crash dump file. The situation is really bad and I need to find out the crash cause as soon as possible.

I suspect that the problem is a concurrency problem but I'm not sure. I have access to the source code and crash dump files but I don't know how to use the crash dumps to pin point the problem.

Any suggestions are much appreciated.

A: 

Does your app create a core file? If so, I would use gdb to debug this problem.

David B
+2  A: 

If the problem takes an hour or so to manifest itself, it might be a memory problem - perhaps running out, or perhaps trampling (using already released memory, for example).

You say you've got the crash dump files - that is a core dump?

Assuming you have a core dump, then the first step should probably be to print the stack backtrace:

gdb program core
> where

This should tell you where the program was when the crash occurred. What else is available depends on how the server was compiled. If possible, you should recompile with debugging enabled (that would be with the '-g' flag with GCC). This would give you more information from the stack backtrace.

If your problem is memory related, consider running with valgrind.

Also consider building and running with a debugging version of malloc(). A debugging version will detect memory abuses that normal versions miss - or crash on.

Jonathan Leffler
Thanks for your detailed answer. The server has been compiled with debugging information and creates a core dump when it crashes. The server initializes many threads. Whenever the threads are less than 200 the server remains working for bit more than an hour but when the increase in the number of requests increases the number of the threads in the pool to about 300 the server crashes much sooner.May you please tell me how does using a debugging version of malloc may help?Thanks
O. Askari
Jonathan Leffler
Regarding 'under 200 threads OK; over 300 crashes sooner'; have you looked into whether the server allocates a fixed size pool of some resource (maybe 200, maybe 256) and doesn't properly check for exhaustion of that resource. It might be a set of mutexes, or semaphores, or something else that is triggering a memory abuse. Is the server code you wrote? Should you look into limiting the number of threads at work?
Jonathan Leffler
Unfortunately I haven't wrote the code so i really don't know what exactly is going on in the server. I'm just responsible for stabilizing the server and unfortunately the person who wrote it isn't available.The resource exhaustion idea is a good one. I'll try to find anything resource that may cause this problem. I'll also try using the debugging malloc solution. Hope it shows me the real cause of the crashes. Thanks a lot for your help
O. Askari
+2  A: 

gdb -c core.file exename

bt

Assuming it exename was built with debugging symbols (and all of it's dynamic dependancies are in the path) that will get you a back trace. 'up' and 'down' will move you up and down in the stack, and 'p varname' can be used to examine locals and parameters...

You could also try running it under valgrind:

valgrind --tool=memcheck --leak-check=full exename

dicroce
+1  A: 

The first thing to look for is the error message that you get when the program crashes. This will often tell you what kind of error occurred. For example "segmentation fault" or "SIGSEGV" almost certainly mean that your program has de-referenced a NULL or otherwise invalid pointer. If the program is written in C++, then the error message will often tell you the name of any uncaught exception.

If you aren't seeing the error message, then run the program from the command line, or pipe its output into a file.

In order for a core file to be really useful, you need to compile your program without optimisation and with debugging information. GCC needs the following options: -g -O0. (Make sure your build doesn't have any other -O options.)

Once you have the core file, then open it in gdb with:

gdb YOUR-APP COREFILE

Type where to see the point where the crash occurred. You are basically in a normal debugging session - you can examine variables, move up and down the stack, switch between threads and whatever.

If your program has crashed, then it's probably an invalid memory access - so you need to look for a pointer that has zero-value, or that points to bad looking data. You might not find the problem at the very bottom of the stack, you might have to move up the stack a few levels before you find the problem.

Good luck!

alex tingle