views:

135

answers:

7

Hi,

This is my follow up to the previous post on memory management issues. The following are the issues I know.

1)data races (atomicity violations and data corruption)

2)ordering problems

3)misusing of locks leading to dead locks

4)heisenbugs

Any other issues with multi threading ? How to solve them ?

+1  A: 

The four most common problems with theading are

1-Deadlock
2-Livelock
3-Race Conditions
4-Starvation

Eric
@Eric Thank you very much But How to solve these issues ?
brett
All of these problems can be solved with semaphores (locks). You need to carefully understand what you are doing first. Try not to over do it with threads, they are a tool to help your program do stuff, not some magical trick you need to put all over the place
Eric
+1  A: 

How to solve [issues with multi threading]?

A good way to "debug" MT applications is through logging. A good logging library with extensive filtering options makes it easier. Of course, logging itself influences the timing, so you still can have "heisenbugs", but it's much less likely than when you're actuall breaking into the debugger.

Prepare and plan for that. Include a good logging facility into your application from the start.

sbi
+2  A: 

Unfortunately there's no good pill that helps automatically solve most/all threading issues. Even unit tests that work so well on single-threaded pieces of code may never detect an extremely subtle race condition.

One thing that will help is keeping the thread-interaction data encapsulated in objects. The smaller the interface/scope of the object, the easier it will be to detect errors in review (and possibly testing, but race conditions can be a pain to detect in test cases). By keeping a simple interface that can be used, clients that use the interface will also be correct just by default. By building up a bigger system from lots of smaller pieces (only a handful of which actually do thread-interaction), you can go a long way towards averting threading errors in the first place.

Mark B
A: 

Make your threads as simple as possible.

Try not to use global variables. Global constants (actual constants that never change) is fine. When you do need to use global or shared variables you need to protect them with some type of mutex/lock (semaphore, monitor, ...).

Make sure that you actually understand what how your mutexes work. There are a few different implementations which can work differently.

Try to organize your code so that the critical sections (places where you hold some type of lock(s) ) are as quick as possible. Be aware that some functions may block (sleep or wait on something and keep the OS from allowing that thread to continue running for some time). Do not use these while holding any locks (unless absolutely necessary or during debugging as it can sometimes show other bugs).

Try to understand what more threads actually does for you. Blindly throwing more threads at a problem is very often going to make things worse. Different threads compete for the CPU and for locks.

Deadlock avoidance requires planning. Try to avoid having to acquire more than one lock at a time. If this is unavoidable decide on an ordering you will use to acquire and release the locks for all threads. Make sure you know what deadlock really means.

Debugging multi-threaded or distributed applications is difficult. If you can do most of the debugging in a single threaded environment (maybe even just forcing other threads to sleep) then you can try to eliminate non-threading centric bugs before jumping into multi-threaded debugging.

Always think about what the other threads might be up to. Comment this in your code. If you are doing something a certain way because you know that at that time no other thread should be accessing a certain resource write a big comment saying so.

You may want to wrap calls to mutex locks/unlocks in other functions like:

int my_lock_get(lock_type lock, const char * file, unsigned line, const char * msg) {

 thread_id_type me = this_thread();

 logf("%u\t%s (%u)\t%s:%u\t%s\t%s\n", time_now(), thread_name(me), me, "get", msg);

 lock_get(lock);

 logf("%u\t%s (%u)\t%s:%u\t%s\t%s\n", time_now(), thread_name(me), me, "in", msg);

}

And a similar version for unlock. Note, the functions and types used in this are all made up and not overly based on any one API.

Using something like this you can come back if there is an error and use a perl script or something like it to run queries on your logs to examine where things went wrong (matching up locks and unlocks, for instance).

Note that your print or logging functionality may need to have locks around it as well. Many libraries already have this built in, but not all do. These locks need to not use the printing version of the lock_[get|release] functions or you'll have infinite recursion.

nategoose
A: 
  1. Beware of global variables even if they are const, in particular in C++. Only POD that are statically initialized "à la" C are good here. As soon as a run-time constructor comes into play, be extremely careful. AFAIR initialization order of variables with static linkage that are in different compilation units are called in an undefined order. Maybe C++ classes that initialize all their members properly and have an empty function body, could be ok nowadays, but I once had a bad experience with that, too.

    This is one of the reason why on the POSIX side pthread_mutex_t is much easier to program than sem_t: it has a static initializer PTHREAD_MUTEX_INITIALIZER.

  2. Keep critical sections as short as possible, for two reasons: it might be more efficient at the end, but more importantly it is easier to maintain and to debug.

    A critical section should never be longer that a screen, including the locking and unlocking that is needed to protect it, and including the comments and assertions that help the reader to understand what is happening.

    Start implementing critical sections very rigidly maybe with one global lock for them all, and relax the constraints afterwards.

  3. Logging might is difficult if many threads start to write at the same time. If every thread does a reasonable amount of work try to have them each write a file of their own, such that they don't interlock each other.

    But beware, logging changes behavior of code. This can be bad when bugs disappear, or beneficial when bugs appear that you otherwise wouldn't have noticed.

    To make a post-mortem analysis of such a mess you have to have accurate timestamps on each line such that all the files can be merged and give you a coherent view of the execution.

Jens Gustedt
+1  A: 

-> Add priority inversion to that list.

As another poster eluded to, log files are wonderful things. For deadlocks, using a LogLock instead of a Lock can help pinpoint when you entities stop working. That is, once you know you've got a deadlock, the log will tell you when and where locks were instantiated and released. This can be enormously helpful in tracking these things down.

I've found that race conditions when using an Actor model following the same message->confirm->confirm received style seem to disappear. That said, YMMV.

wheaties
+2  A: 

Eric's list of four issues is pretty much spot on. But debugging these issues is tough.

For deadlock, I've always favored "leveled locks". Essentially you give each type of lock a level number. And then require that a thread aquire locks that are monotonic.

To do leveled locks, you can declare a structure like this:

typedef struct {
   os_mutex actual_lock;
   int level;
   my_lock *prev_lock_in_thread;
} my_lock_struct;

static __tls my_lock_struct *last_lock_in_thread;

void my_lock_aquire(int level, *my_lock_struct lock) {
    if (last_lock_in_thread != NULL) assert(last_lock_in_thread->level < level)
    os_lock_acquire(lock->actual_lock)
    lock->level = level
    lock->prev_lock_in_thread = last_lock_in_thread
    last_lock_in_thread = lock
}

What's cool about leveled locks is the possibility of deadlock causes an assertion. And with some extra magic with FUNC and LINE you know exactly what badness your thread did.

For data races and lack of synchronization, the current situation is pretty poor. There are static tools that try to identify issues. But false positives are high.

The company I work for ( http://www.corensic.com ) has a new product called Jinx that actively looks for cases where race conditions can be exposed. This is done by using virtualization technology to control the interleaving of threads on the various CPUs and zooming in on communication between CPUs.

Check it out. You probably have a few more days to download the Beta for free.

Jinx is particularly good at finding bugs in lock free data structures. It also does very well at finding other race conditions. What's cool is that there are no false positives. If your code testing gets close to a race condition, Jinx helps the code go down the bad path. But if the bad path doesn't exist, you won't be given false warnings.

Dave Dunn