views:

51

answers:

1

Hi, I have a server-client program in which there are multiple threads in both the server and client. There are variable number of clients and servers (like 3 servers (replicas), 10 clients). I am debugging a source file in this program. I think there is some kind of deadlock, possibly the following:

A mutex lock is already held by a server method and a request from the client invokes a server method which wants to acquire the mutex again.

The program is launched by a test script which spawns the servers and clients and makes the client send specific requests to the servers. I have used the following code in the suspicious area of code to see if there is a deadlock, but it doesnt seem to work, ie the code enters neither block:

if (pthread_mutex_lock(&a_mutex) == EDEADLK) {
    cout<<"couldnt acquire lock."<<endl;
}
else cout<<"acquired lock"<<endl;

I tried to debug (by attaching one running server process) with gdb. I added "display" and "watch" (in different runs of gdb) for a_mutex. I get a result of the following form:

1: a_mutex = {__data = {__lock = 2, __count = 0, __owner = 4193, __kind = 0, __nusers = 2, 
{__spins = 0, __list = {__next = 0x0}}}, 
  __size = "\002\000\000\000\000\000\000\000a\020\000\000\000\000\000\000\002\000\000  \000\000\000\000", __align = 2}

I dont know the meaning of all the things in the above output, but I could see that a thread (4193) is holding the mutex. I saw the backtrace of that thread (snipped):

#0  0xb8082430 in __kernel_vsyscall ()
#1  0xb7e347a6 in nanosleep () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7e345be in sleep () from /lib/tls/i686/cmov/libc.so.6
#3  0x0804cb59 in class1::method1 (this=0xbfa9fe6c, clt=1, id=
    {static npos = 4294967295, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xb7c9c11c "l/%\b"}})
at file1.cc:33

I dont know how and where the bug is.

I would highly appreciate any help with the following questions:

  1. What is a good method of debugging such conditions/programs?
  2. How do I detect the deadlock condition (ie where a lock is being held and not released)?
  3. In such a multi-process program, is there a better way of using gdb? (ie inspecting states in all processes? configuring gdb to watch/display a variable before the start of the process?)
  4. Because, when I attach gdb with the server after it has started (by the tester script), the server might have already advanced ahead of the code which I want to inspect. I tried adding a sleep(20) before the suspicious area to help me with gdb, but I think this is not a good way. I also think that opening multiple terminals, starting servers and client manually and checking states of each of them is also not a very good idea (please correct me if i am wrong).

PS: I have read this question already.

Thank you very much.

+3  A: 

Use GDB and attach it to the hung program. Then use "thread apply all bt" (I think but I don't have a system handy).

It'll give you a backtrace of all of the threads and you should be able to see which thread is doing what.

If this problem is easily reproducible too you can use strace to give you some info one which locks are being taken.

Mike