views:

230

answers:

1

I have lockless queues written in C in form of a linked list that contains requests from several threads posted to and handled in a single thread. After a few hours of stress I end up having the last request's next pointer pointing to itself, which creates an endless loop and locks up the handling thread.

The application runs (and fails) on both Linux and Windows. I'm debugging on Windows, where my COMPARE_EXCHANGE_PTR maps to InterlockedCompareExchangePointer.

This is the code that pushes a request to the head of the list, and is called from several threads:

void push_request(struct request * volatile * root, struct request * request)
{
    assert(request);

    do {
        request->next = *root;
    } while(COMPARE_EXCHANGE_PTR(root, request, request->next) != request->next);
}

This is the code that gets a request from the end of the list, and is only called by a single thread that is handling them:

struct request * pop_request(struct request * volatile * root)
{
    struct request * volatile * p;
    struct request * request;

    do {
        p = root;
        while(*p && (*p)->next) p = &(*p)->next; // <- loops here
        request = *p;
    } while(COMPARE_EXCHANGE_PTR(p, NULL, request) != request);

     assert(request->next == NULL);

     return request;
}

Note that I'm not using a tail pointer because I wanted to avoid the complication of having to deal with the tail pointer in push_request. However I suspect that the problem might be in the way I find the end of the list.

There are several places that push a request into the queue, but they all look generaly like this:

// device->requests is defined as struct request * volatile requests;
struct request * request = malloc(sizeof(struct request));
if(request) {
    // fill out request fields
    push_request(&device->requests, request);
    sem_post(device->request_sem);
}

The code that handles the request is doing more than that, but in essence does this in a loop:

if(sem_wait_timeout(device->request_sem, timeout) == sem_success) {
    struct request * request = pop_request(&device->requests);
    // handle request
    free(request);
}

I also just added a function that is checking the list for duplicates before and after each operation, but I'm afraid that this check will change the timing so that I will never encounter the point where it fails. (I'm waiting for it to break as I'm writing this.)

When I break the hanging program the handler thread loops in pop_request at the marked position. I have a valid list of one or more requests and the last one's next pointer points to itself. The request queues are usually short, I've never seen more then 10, and only 1 and 3 the two times I could take a look at this failure in the debugger.

I thought this through as much as I could and I came to the conclusion that I should never be able to end up with a loop in my list unless I push the same request twice. I'm quite sure that this never happens. I'm also fairly sure (although not completely) that it's not the ABA problem.

I know that I might pop more than one request at the same time, but I believe this is irrelevant here, and I've never seen it happening. (I'll fix this as well)

I thought long and hard about how I can break my function, but I don't see a way to end up with a loop.

So the question is: Can someone see a way how this can break? Can someone prove that this can not?

Eventually I will solve this (maybe by using a tail pointer or some other solution - locking would be a problem because the threads that post should not be locked, I do have a RW lock at hand though) but I would like to make sure that changing the list actually solves my problem (as opposed to makes it just less likely because of different timing).

+3  A: 

It's subtle but you do have a race condition there.

Start with a list with one element in it, req1. So we have:

device->requests == req1;
req1->next == NULL;

Now, we push a new element req2, and simultaneously try to pop the queue. The push for req2 starts first. The while loop body runs, so we now have:

device->requests == req1;
req1->next == NULL;
req2->next == req1;

Then the COMPARE_EXCHANGE_PTR runs, so we have:

device->requests == req2;
req1->next == NULL;
req2->next == req1;

...and the COMPARE_EXCHANGE_PTR() returns req1. Now, at this point, before the comparison in the while condition, the push gets interrupted and the pop starts running.

The pop runs correctly to completion, popping off req1 - which means that we have:

device->requests == req2;
req2->next == NULL;

The push restarts. It now fetches request->next to do the comparison - and it fetches the new value of req2->next, which is NULL. It compares req1 with NULL, the comparison succeeds, the while loop runs again, and now we have:

device->requests == req2;
req2->next == req2;

This time the test fails, the while loop exits, and you have your loop.


This should fix it:

void push_request(struct request * volatile * root, struct request * request)
{
    struct request *oldroot;

    assert(request);

    do {
        request->next = oldroot = *root;
    } while(COMPARE_EXCHANGE_PTR(root, request, oldroot) != oldroot);
}
caf
Yes, this is what's going on, I see it now - thanks!
Fozi
@caf, So after reading that three times and turning off music so I could concentrate, the underlying issue is that there is an unguarded write to `request->next` inside `pop_request` that violates the assumption that only `push_request` modifies `request->next`. Right?
MSN
@MSN: Yes, although the write is guarded by the atomic compare exchange - the problem is that it's read *twice* in `push_request`, and only one of the reads is guarded.
caf