ansaurus

Question

Debugging instance of another thread altering my data

Answer 1

A:

I know two ways to handle such errors:

1) Read the code again and again, looking for possible errors. I can think about two errors that can cause this: unsynchronized access or writing by incorrect memory address. Maybe you have more ideas.

2) Logging, logging an logging. Add lot of optional traces (OutputDebugString or log file), in every critical place, which contain enough information - indexes, variable values etc. It is a good idea to add this tracing with some #ifdef. Reproduce the bug and try to understand from the log, what happens.

Alex Farber 2010-03-17 12:10:40

Answer 2

+2 A:

How about wrapping your data with a transparent mutexed class? Then you could apply additional lock state checking.

class critical_section;

template < class T >
class element_wrapper
{
public:
    element_wrapper(const T& v) : val(v) {}
    element_wrapper() {}
    const element_wrapper& operator = (const T& v) {
#ifdef _DEBUG_CONCURRENCY 
        if(!cs->is_locked())
            _CrtDebugBreak();
#endif
        val = v;
        return *this;
    }
    operator T() { return val; }
    critical_section* cs;
private:
    T val;
};

As for critical section implementation:

class critical_section
{
public:
    critical_section() : locked(FALSE) {
        ::InitializeCriticalSection(&cs);
    }
    ~critical_section() {
        _ASSERT(!locked);
        ::DeleteCriticalSection(&cs);
    }
    void lock() {
        ::EnterCriticalSection(&cs);
        locked = TRUE;
    }
    void unlock() {
        locked = FALSE;
        ::LeaveCriticalSection(&cs);
    }
    BOOL is_locked() {
        return locked;
    }
private:
    CRITICAL_SECTION cs;
    BOOL locked;
};

Actually, instead of custom critical_section::locked flag, one could use ::TryEnterCriticalSection (followed by ::LeaveCriticalSection if it succeeds) to determine if a critical section is owned. Though, the implementation above is almost as good.

So the appropriate usage would be:

typedef std::vector< element_wrapper<int> > cont_t;

void change(cont_t::reference x) { x.lock(); x = 1; x.unlock(); }

int main()
{
    cont_t container(10, 0); 

    std::for_each(container.begin(), container.end(), &change);
}

Janusz Lenar 2010-03-17 12:25:50

I haven't learned about templates yet so can't really judge... but if your answer gets up-votes then I will investigate.

Mick 2010-03-17 12:51:19

If you don't want to use templates, simply replace `T` with a type of data stored here. Actually, what I meant to communicate is just a concept of wrapping simple data with overriden setter function.

Janusz Lenar 2010-03-17 13:51:49

@malleor: my instinct tell me this is the way to go...

Mick 2010-03-17 15:17:02

@malleor: I had assumed that "islocked()" was some standard function that everyone should know about... but I can't find it anywhere :-( how do I tell if a critical section is currently locked?

Mick 2010-03-17 17:49:21

@Mick: No, no.. My snippet is just a draft. I'll add a `critical_section` code right now.

Janusz Lenar 2010-03-17 18:00:49

@malleor: reps heading your way.

Mick 2010-03-17 18:56:20

Answer 3

A:

Your best (fastest) bet is still to revise the mutex code. As you said, it is the obvious explanation - why not trying to really find the explanation (by logic) instead of additional hints (by coding) that may come out inconclusive? If the code review doesn't turn out something useful you may still take the mutex code and use it for a test run. The first try should not be to reproduce the bug in your system but to ensure correct implementation of the mutex - implement threads (start from 2 upwards) that all try to access the same data structure again and again with a random small delay in each of them to have them jitter around on the time line. If this test results in a buggy mutex which you simply can't identify in the code then you have fallen victim to some architecture dependant effect (maybe intstruction reordering, multi-core cache incoherency, etc.) and need to find another mutex implementation. If OTOH you find an obvious bug in the mutex, try to exploit it in your real system (instrument your code so that the error should appear much more often) so that you can ensure that it really is the cause of your original problem.

slartibartfast 2010-03-17 14:54:07

Answer 4

A:

I was thinking about this while pedaling to work. One possible way of handling this is to make portions of the memory in question be read-only when it is not actively being accessed and protected via critical section ownership. This is assuming that the problem is caused by a thread writing to the memory when it does not own the appropriate critical section.

There are quite a few limitations to this that prevent it from working. Most importantly is the fact that I think you can only set privileges on a page by page basis (4K I believe). So that would likely require some very specific changes to your allocation scheme so that you could narrow down the appropriate section to protect. The second problem is that it would not catch the rogue thread writing to the memory if another thread actively owned the critical section. But it would catch it and cause an immediate access violation if the critical section was not owned.

The idea would be to do to change your EnterCriticalSection calls to:

EnterCriticalSection()
VirtualProtect( … PAGE_READWRITE … );

And change the LeaveCriticalSection calls to:

VirtualProtect( … PAGE_READONLY … );
LeaveCriticalSection()

The following chunk of code shows a call to VirtualProtect

int main( int argc, char* argv[] 1
{
  unsigned char *mem;
  int i;
  DWORD dwOld;

  // this assume 4K page size
  mem = malloc( 4096 * 10 );
  for ( i = 0; i < 10; i++ )
     mem[i * 4096] = i;
  // set the second page to be readonly.  The allocation from malloc is
  // not necessarily on a page boundary, but this will definitely be in
  // the second page.
  printf( "VirtualProtect res = %d\n",
          VirtualProtect( mem + 4096,
                          1, // ends up setting entire page
                          PAGE_READONLY, &dwOld ));
  // can still read it
  for ( i = 1; i < 10; i++ )
     printf( "%d ", mem[i*4096] );
  printf( "\n" );

  // Can write to all but the second page
  for ( i = 0; i < 10; i++ )
     if ( i != 1 ) // avoid second page which we made readonly
        mem[i] = 1;
  // this causes an access violation
  mem[4096] = 1;
}

Mark Wilkins 2010-03-17 15:09:34

Hmmmm... not very practical... perhaps you should tell me what you come up with while pedaling home :-)

Mick 2010-03-17 15:13:57

@Mick: I'll work on that :). It's an interesting problem, though, and I have had to debug a few like this. The reason I spent the time studying it is because I think I can use it in the future. I always use my own allocation function (even if it is a simple #define malloc mymalloc type of thing), so it is a simple matter to control page allocations to work with this.

Mark Wilkins 2010-03-17 15:23:54

@Mark: What do you think of malleor's answer then?

Mick 2010-03-17 15:43:27

@Mick: If you can wrap all of the assignments, then it may work well. A potential drawback is that if the "random" value that gets into the memory is not placed there via an assignment, then it won't catch it.

Mark Wilkins 2010-03-17 15:53:06

ansaurus

tags:

views:

answers:

Debugging instance of another thread altering my data

related questions