views:

88

answers:

4

I have a huge global array of structures. Some regions of the array are tied to individual threads and those threads can modify their regions of the array without having to use critical sections. But there is one special region of the array which all threads may have access to. The code that accesses these parts of the array needs to carefully use critical sections (each array element has its own critical section) to prevent any possibility of two threads writing to the structure simultaneously.

Now I have a mysterious bug I am trying to chase, it is occurring unpredictably and very infrequently. It seems that one of the structures is being filled with some incorrect number. One obvious explanation is that another thread has accidentally been allowed to set this number when it should be excluded from doing so.

Unfortunately it seems close to impossible to track this bug. The array element in which the bad data appears is different each time. What I would love to be able to do is set some kind of trap for the bug as follows: I would enter a critical section for array element N, then I know that no other thread should be able to touch the data, then (until I exit the critical section) set some kind of flag to a debugging tool saying "if any other thread attempts to change the data here please break and show me the offending patch of source code"... but I suspect no such tool exists... or does it? Or is there some completely different debugging methodology that I should be employing.

A: 

I know two ways to handle such errors:

1) Read the code again and again, looking for possible errors. I can think about two errors that can cause this: unsynchronized access or writing by incorrect memory address. Maybe you have more ideas.

2) Logging, logging an logging. Add lot of optional traces (OutputDebugString or log file), in every critical place, which contain enough information - indexes, variable values etc. It is a good idea to add this tracing with some #ifdef. Reproduce the bug and try to understand from the log, what happens.

Alex Farber
+2  A: 

How about wrapping your data with a transparent mutexed class? Then you could apply additional lock state checking.

class critical_section;

template < class T >
class element_wrapper
{
public:
    element_wrapper(const T& v) : val(v) {}
    element_wrapper() {}
    const element_wrapper& operator = (const T& v) {
#ifdef _DEBUG_CONCURRENCY 
        if(!cs->is_locked())
            _CrtDebugBreak();
#endif
        val = v;
        return *this;
    }
    operator T() { return val; }
    critical_section* cs;
private:
    T val;
};

As for critical section implementation:

class critical_section
{
public:
    critical_section() : locked(FALSE) {
        ::InitializeCriticalSection(&cs);
    }
    ~critical_section() {
        _ASSERT(!locked);
        ::DeleteCriticalSection(&cs);
    }
    void lock() {
        ::EnterCriticalSection(&cs);
        locked = TRUE;
    }
    void unlock() {
        locked = FALSE;
        ::LeaveCriticalSection(&cs);
    }
    BOOL is_locked() {
        return locked;
    }
private:
    CRITICAL_SECTION cs;
    BOOL locked;
};

Actually, instead of custom critical_section::locked flag, one could use ::TryEnterCriticalSection (followed by ::LeaveCriticalSection if it succeeds) to determine if a critical section is owned. Though, the implementation above is almost as good.

So the appropriate usage would be:

typedef std::vector< element_wrapper<int> > cont_t;

void change(cont_t::reference x) { x.lock(); x = 1; x.unlock(); }

int main()
{
    cont_t container(10, 0); 

    std::for_each(container.begin(), container.end(), &change);
}
Janusz Lenar
I haven't learned about templates yet so can't really judge... but if your answer gets up-votes then I will investigate.
Mick
If you don't want to use templates, simply replace `T` with a type of data stored here. Actually, what I meant to communicate is just a concept of wrapping simple data with overriden setter function.
Janusz Lenar
@malleor: my instinct tell me this is the way to go...
Mick
@malleor: I had assumed that "islocked()" was some standard function that everyone should know about... but I can't find it anywhere :-( how do I tell if a critical section is currently locked?
Mick
@Mick: No, no.. My snippet is just a draft. I'll add a `critical_section` code right now.
Janusz Lenar
@malleor: reps heading your way.
Mick
A: 

Your best (fastest) bet is still to revise the mutex code. As you said, it is the obvious explanation - why not trying to really find the explanation (by logic) instead of additional hints (by coding) that may come out inconclusive? If the code review doesn't turn out something useful you may still take the mutex code and use it for a test run. The first try should not be to reproduce the bug in your system but to ensure correct implementation of the mutex - implement threads (start from 2 upwards) that all try to access the same data structure again and again with a random small delay in each of them to have them jitter around on the time line. If this test results in a buggy mutex which you simply can't identify in the code then you have fallen victim to some architecture dependant effect (maybe intstruction reordering, multi-core cache incoherency, etc.) and need to find another mutex implementation. If OTOH you find an obvious bug in the mutex, try to exploit it in your real system (instrument your code so that the error should appear much more often) so that you can ensure that it really is the cause of your original problem.

slartibartfast
A: 

I was thinking about this while pedaling to work. One possible way of handling this is to make portions of the memory in question be read-only when it is not actively being accessed and protected via critical section ownership. This is assuming that the problem is caused by a thread writing to the memory when it does not own the appropriate critical section.

There are quite a few limitations to this that prevent it from working. Most importantly is the fact that I think you can only set privileges on a page by page basis (4K I believe). So that would likely require some very specific changes to your allocation scheme so that you could narrow down the appropriate section to protect. The second problem is that it would not catch the rogue thread writing to the memory if another thread actively owned the critical section. But it would catch it and cause an immediate access violation if the critical section was not owned.

The idea would be to do to change your EnterCriticalSection calls to:

EnterCriticalSection()
VirtualProtect( … PAGE_READWRITE … );

And change the LeaveCriticalSection calls to:

VirtualProtect( … PAGE_READONLY … );
LeaveCriticalSection()

The following chunk of code shows a call to VirtualProtect

int main( int argc, char* argv[] 1
{
  unsigned char *mem;
  int i;
  DWORD dwOld;

  // this assume 4K page size
  mem = malloc( 4096 * 10 );
  for ( i = 0; i < 10; i++ )
     mem[i * 4096] = i;
  // set the second page to be readonly.  The allocation from malloc is
  // not necessarily on a page boundary, but this will definitely be in
  // the second page.
  printf( "VirtualProtect res = %d\n",
          VirtualProtect( mem + 4096,
                          1, // ends up setting entire page
                          PAGE_READONLY, &dwOld ));
  // can still read it
  for ( i = 1; i < 10; i++ )
     printf( "%d ", mem[i*4096] );
  printf( "\n" );

  // Can write to all but the second page
  for ( i = 0; i < 10; i++ )
     if ( i != 1 ) // avoid second page which we made readonly
        mem[i] = 1;
  // this causes an access violation
  mem[4096] = 1;
}
Mark Wilkins
Hmmmm... not very practical... perhaps you should tell me what you come up with while pedaling home :-)
Mick
@Mick: I'll work on that :). It's an interesting problem, though, and I have had to debug a few like this. The reason I spent the time studying it is because I think I can use it in the future. I always use my own allocation function (even if it is a simple #define malloc mymalloc type of thing), so it is a simple matter to control page allocations to work with this.
Mark Wilkins
@Mark: What do you think of malleor's answer then?
Mick
@Mick: If you can wrap all of the assignments, then it may work well. A potential drawback is that if the "random" value that gets into the memory is not placed there via an assignment, then it won't catch it.
Mark Wilkins