views:

443

answers:

1

We have a program that uses both boost's matrix and and sparse matrix libraries and we are attempting to integrate boost threads. However, when we move from a single-threaded to multi-threaded application we are getting segmentation faults that do not occur in the single-thread case.

We've debugged using gdb (in Eclipse) and I've found that the seg faults are occurring during function calls in the boost code, i.e. I try accessing an entry of a sparse matrix and the stack trace goes into boost code and dies at some point (not always the same point) in those files.

I'm confused because these matrices are allocated within the individual threads and all shared resources are protected by mutex locks. Furthermore, I didn't think threading often caused seg faults, just multiple accesses and bad data. However, I'm obviously not experienced with multi-threaded programming so I was hoping that someone who is more experienced in this area might be able to provide some advice.

I'm using the boost managed make system and an example compilation command is

g++ -DNDEBUG -I"/usr/local/include/boost-1_38" -I/usr/local/include -I"/home/scandido/workspace/BigSHOT/src" -I"/home/scandido/workspace/BigSHOT/src/base" -O3 -Wall -c -fmessage-length=0 `freetype-config --cflags` -pthread -MMD -MP -MF"src/utils/timing_info.d" -MT"src/utils/timing_info.d" -o"src/utils/timing_info.o" "../src/utils/timing_info.cpp"

and the linker command is

g++ -L"/usr/local/lib" -L/usr/local/lib -o"BigSHOT"  ./src/utils/timing_info.o ... many more objects ... ./src/base/pomdp/policy_fn/EventDriven.o ./src/base/pomdp/policy_fn/Greedy.o  ./src/anotheralgorithm.o   -lboost_serialization-gcc43-mt -lpthread -lboost_thread-gcc43-mt -lboost_program_options-gcc43-mt -lboost_iostreams-gcc43-mt -lpng -lpngwriter -lz -lfreetype

Here is a stack trace for the thread that seg faults:

Thread [5] (Suspended: Signal 'SIGSEGV' received. Description: Segmentation fault.) 
    17 boost::numeric::ublas::mapped_matrix<bool, boost::numeric::ublas::basic_row_major<unsigned long, long>, boost::numeric::ublas::map_std<unsigned long, bool, std::allocator<std::pair<unsigned long const, bool> > > >::operator() /usr/local/include/boost-1_38/boost/numeric/ublas/matrix_sparse.hpp:377 0x000000000041c328 
    16 BigSHOT::Fire1FireState::get_cell() /home/scandido/workspace/BigSHOT/src/systems/fire1/pomdp/Fire1State.cpp:51 0x0000000000419a75 
    15 BigSHOT::Fire1SquareRegionProbObsFn::operator() /home/scandido/workspace/BigSHOT/src/systems/fire1/obs_fn/Fire1SquareRegionProbObsFn.cpp:92 0x000000000042ac37 
    14 BigSHOT::Fire1SquareRegionProbObsFn::operator() /home/scandido/workspace/BigSHOT/src/systems/fire1/obs_fn/Fire1SquareRegionProbObsFn.cpp:66 0x000000000042a8bf 
    13 BigSHOT::BayesFilterFn<BigSHOT::Fire1Belief, BigSHOT::Fire1State, BigSHOT::Fire1Action, BigSHOT::Fire1Observation>::update() /home/scandido/workspace/BigSHOT/src/base/pomdp/filter_fn/BayesFilterFn.h:50 0x0000000000445c3b 
    12 BigSHOT::HyperParticleFilter<BigSHOT::Fire1Belief, BigSHOT::Fire1Action, BigSHOT::Fire1Observation>::future_evolution() /home/scandido/workspace/BigSHOT/src/base/hpf/HyperParticleFilter.h:127 0x00000000004308e0 
    11 BigSHOT::HyperParticleFilter<BigSHOT::Fire1Belief, BigSHOT::Fire1Action, BigSHOT::Fire1Observation>::hyperfilter() /home/scandido/workspace/BigSHOT/src/base/hpf/HyperParticleFilter.h:86 0x000000000043149b 
    10 BigSHOT::HyperParticleFilterSystem<BigSHOT::HyperCostFn<BigSHOT::Fire1Belief, BigSHOT::Fire1Action>, BigSHOT::PolicyFn<BigSHOT::Fire1Belief, BigSHOT::Fire1Action>, BigSHOT::Fire1Belief, BigSHOT::Fire1Action, BigSHOT::Fire1Observation>::next_stage() /home/scandido/workspace/BigSHOT/src/base/hpf/HyperParticleFilter.h:189 0x0000000000446180 
    9 hyperfilter() /home/scandido/workspace/BigSHOT/src/anotheralgorithm.cpp:126 0x0000000000437798 
    8 hf_thread_wrapper() /home/scandido/workspace/BigSHOT/src/anotheralgorithm.cpp:281 0x0000000000437cd9 
    7 boost::_bi::list1<boost::_bi::value<int> >::operator()<void (*)(int), boost::_bi::list0>() /usr/local/include/boost-1_38/boost/bind.hpp:232 0x000000000043f25a 
    6 boost::_bi::bind_t<void, void (*)(int), boost::_bi::list1<boost::_bi::value<int> > >::operator() /usr/local/include/boost-1_38/boost/bind/bind_template.hpp:20 0x000000000043f298 
    5 boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(int), boost::_bi::list1<boost::_bi::value<int> > > >::run() /usr/local/include/boost-1_38/boost/thread/detail/thread.hpp:56 0x000000000043f2b6 
    4 thread_proxy()  0x00007f28241c893f 
    3 start_thread()  0x00007f28243d93ba 
    2 clone()  0x00007f2822a4ffcd 
    1 <symbol is not available> 0x0000000000000000

The line of code where it happens is the second of this block:

const_reference operator () (size_type i, size_type j) const {
  const size_type element = layout_type::element (i, size1_, j, size2_);
  const_subiterator_type it (data ().find (element));

I'd like to reiterate that the seg fault doesn't always occur at the same place in the code, but always when executing something in the boost code.

Thanks in advance for your help!

+2  A: 

Segfaults can appear to arise from other, well-debugged libraries (or even from the standard library!) if you corrupt the heap or the free store, for example by double-freeing (or double-deleting) a pointer, accessing a pointer that was already freed (or deleted), freeing (or deleting) a pointer that was not allocated, using delete where you should have used delete[] or vice-versa, etc.

This crash will often happen in a completely different place and at a totally different time from when and where the error occurred. If you have variables other than the matrices shared among multiple threads, and you have a race condition that, for example, causes you to double-delete the shared variable, this could corrupt the free store and later cause a crash inside of the boost matrix code.

You should run your code through a tool like valgrind to try to track down the heap/free store corruption.

Tyler McHenry
Thanks for the suggestions. We are using boost's shared_ptr (smart pointer) so I'm fairly confident that it isn't an issue of deleting memory incorrectly. (There are no delete statements in the threaded code.) We only have one object shared among threads and we only access it at the beginning and end of threads to copy data out and write data back in. We are using locks to prevent simultaneous access between threads. Is there something I'm not considering here?
scandido
I've never used valgrind before. Can you suggest a good tutorial for getting started aside from the documentation on their site?
scandido
valgrind is pretty simple to use; it's understanding the output that's sometimes a challenge. All you really need to do to get started running a program like this is compile with debugging symbols and then run: valgrind --tool=memcheck ./yourapp. For each memory access error, it will report where it occurred, what sort of error it was, and then give you one or two abbreviated stack traces that show you where the incompatible operations occurred. Unfortunately as far as I know there is no better resource than their official docs. Take a look at the memcheck section, since that's what you need.
Tyler McHenry
It seems you are correct. After removing all the bad stuff that valgrind brought to my attention, the seg fault seems to have disappeared. Thanks for your help!
scandido