views:

198

answers:

4

How can I wait for a detached thread to finish in C++?

I don't care about an exit status, I just want to know whether or not the thread has finished.

I'm trying to provide a synchronous wrapper around an asynchronous thirdarty tool. The problem is a weird race condition crash involving a callback. The progression is:

  1. I call the thirdparty, and register a callback
  2. when the thirdparty finishes, it notifies me using the callback -- in a detached thread I have no real control over.
  3. I want the thread from (1) to wait until (2) is called.

I want to wrap this in a mechanism that provides a blocking call. So far, I have:

class Wait {
  public:
  void callback() {
    pthread_mutex_lock(&m_mutex);
    m_done = true;
    pthread_cond_broadcast(&m_cond);
    pthread_mutex_unlock(&m_mutex);
  }

  void wait() {
    pthread_mutex_lock(&m_mutex);
    while (!m_done) {
      pthread_cond_wait(&m_cond, &m_mutex);
    }
    pthread_mutex_unlock(&m_mutex);
  }

  private:
  pthread_mutex_t m_mutex;
  pthread_cond_t  m_cond;
  bool            m_done;
};

// elsewhere...
Wait waiter;
thirdparty_utility(&waiter);
waiter.wait();

As far as I can tell, this should work, and it usually does, but sometimes it crashes. As far as I can determine from the corefile, my guess as to the problem is this:

  1. When the callback broadcasts the end of m_done, the wait thread wakes up
  2. The wait thread is now done here, and Wait is destroyed. All of Wait's members are destroyed, including the mutex and cond.
  3. The callback thread tries to continue from the broadcast point, but is now using memory that's been released, which results in memory corruption.
  4. When the callback thread tries to return (above the level of my poor callback method), the program crashes (usually with a SIGSEGV, but I've seen SIGILL a couple of times).

I've tried a lot of different mechanisms to try to fix this, but none of them solve the problem. I still see occasional crashes.

EDIT: More details:

This is part of a massively multithreaded application, so creating a static Wait isn't practical.

I ran a test, creating Wait on the heap, and deliberately leaking the memory (i.e. the Wait objects are never deallocated), and that resulted in no crashes. So I'm sure it's a problem of Wait being deallocated too soon.

I've also tried a test with a sleep(5) after the unlock in wait, and that also produced no crashes. I hate to rely on a kludge like that though.

EDIT: ThirdParty details:

I didn't think this was relevant at first, but the more I think about it, the more I think it's the real problem:

The thirdparty stuff I mentioned, and why I have no control over the thread: this is using CORBA.

So, it's possible that CORBA is holding onto a reference to my object longer than intended.

+2  A: 

Yes, I believe that what you're describing is happening (race condition on deallocate). One quick way to fix this is to create a static instance of Wait, one that won't get destroyed. This will work as long as you don't need to have more than one waiter at the same time.

You will also permanently use that memory, it will not deallocate. But it doesn't look like that's too bad.

The main issue is that it's hard to coordinate lifetimes of your thread communication constructs between threads: you will always need at least one leftover communication construct to communicate when it is safe to destroy (at least in languages without garbage collection, like C++).

EDIT: See comments for some ideas about refcounting with a global mutex.

Adam Goode
Unfortunately, this is in a massively multithreaded app, and we really want separate Wait objects for each - otherwise it slows us down too much.
Tim
Also, if we use a static Wait, there is the problem of trying to coordinate which thread needs to resume.
Tim
Ok, you can do this. You can add a refcount field to the Wait object, protected by a global mutex. Start the refcount at 2, and then have the callback and the waiter both decrement the refcount when done. If the global mutex becomes your bottleneck, there are other more complicated solutions.
Adam Goode
Also, you don't need the global mutex if you can use atomic operations for the refcount.
Adam Goode
Excellent! The refcounting is actually provided by CORBA, but that solved the problem! Thanks!
Tim
A: 

To the best of my knowledge there's no portable way to directly ask a thread if its done running (i.e. no pthread_ function). What you are doing is the right way to do it, at least as far as having a condition that you signal. If you are seeing crashes that you are sure are due to the Wait object is being deallocated when the thread that creates it quits (and not some other subtle locking issue -- all too common), the issue is that you need to make sure the Wait isn't being deallocated, by managing from a thread other than the one that does the notification. Put it in global memory or dynamically allocate it and share it with that thread. Most simply don't have the thread being waited on own the memory for the Wait, have the thread doing the waiting own it.

quark
A: 

Are you initializing and destroying the mutex and condition var properly?

Wait::Wait()
{
    pthread_mutex_init(&m_mutex, NULL);
    pthread_cond_init(&m_cond, NULL);
    m_done = false;
}

Wait::~Wait()
{
    assert(m_done);
    pthread_mutex_destroy(&m_mutex);
    pthread_cond_destroy(&m_cond);
}

Make sure that you aren't prematurely destroying the Wait object -- if it gets destroyed in one thread while the other thread still needs it, you'll get a race condition that will likely result in a segfault. I'd recommend making it a global static variable that gets constructed on program initialization (before main()) and gets destroyed on program exit.

Adam Rosenfield
yes, the mutex and cond are initialized/destroyed properly. I'm actually using wrapper classes on those that have been well tested. And yes, I'm certain that Wait is being prematurely destroyed -- while one thread is still in Wait::callback.
Tim
A: 

If your assumption is correct then third party module appears to be buggy and you need to come up with some kind of hack to make your application work.

Static Wait is not feasible. How about Wait pool (it even may grow on demand)? Is you application using thread pool to run? Although there will still be a chance that same Wait will be reused while third party module is still using it. But you can minimize such chance by properly queing vacant Waits in your pool.

Disclaimer: I am in no way an expert in thread safety, so consider this post as a suggestion from a layman.

BostonLogan