views:

281

answers:

3

I am using a c++ library that is meant to be multi-threaded and the number of working threads can be set using a variable. The library uses pthreads. The problem appears when I run the application ,that is provided as a test of library, on a quad-core machine using 3 threads or more. The application exits with a segmentation fault runtime error. When I try to insert some tracing "cout"s in some parts of library, the problem is solved and application finishes normally. When running on single-core machine, no matter what number of threads are used, the application finishes normally.

How can I figure out where the problem seam from?

Is it a kind of synchronization error? how can I find it? is there any tool I can use too check the code ?

A: 

Some general debugging recommendations.

  1. Make sure your build has symbols (compile with -g). This option is orthogonal to other build options (i.e. the decision to build with symbols is independent of the optimization level).
  2. Once you have symbols, take a close look at the call stack of where the seg fault occurs. To do this, first make sure your environment is configured to generate core files (ulimit -c unlimited) and then after the crash, load the program/core in the debugger (gdb /path/to/prog /path/to/core). Once you know what part of your code is causing the crash, that should give you a better idea of what is going wrong.
R Samuel Klatchko
I know nothing about what core is. and I don't know /path/to/core. could you explain more detailed please?
Navid
Well, you will soon discover 'gdb' then :)
Matthieu M.
+2  A: 

Sounds like you're using Linux (you mention pthreads). Have you considered running valgrind?

Valgrind has tools for checking for data race conditions (helgrind) and memory problems (memcheck). Valgrind may be to find such an error in debug mode without needing to produce the crash that release mode produces.

Jeff Foster
I've used helgrind. it suggests some errors as race conditions in sections that are run in serial sections of codealso it indicates some identical errors in parallel parts as shown below, but I don't understand what it means.--------------------------------------------==28556== This conflicts with a previous read of size 4 by thread #6==28556== at 0x806A3ED: Population::GaSortedGroup::GetAt(int) const (SortedGroup.h:181).....(order of method calls)
Navid
A race condition is where you have two or more threads looking at the same bit of data at (potentially) the same time. Generally shared mutable data should be protected by a mutex so you can't (for example) get one thread reading data as another thread updates it. This can give you stale data and lead to problems.
Jeff Foster
A: 

You are running into a race condition.
Where multiple threads are interacting on the same resource.
There are a whole host of possible culprits, but without the source anything we say is a guess.

You want to create a core file; then debug the application with the core file. This will set up the debugger to the state of the application at the point it crashed. This will allow you to examin the variables/registers etc.

How to do this will very depending on your system.

A quick Google revealed this:

http://www.codeguru.com/forum/archive/index.php/t-299035.html

Hope this helps.

Martin York