views:

931

answers:

2

I'm trying to implement a critical section in CUDA using atomic instructions, but I ran into some trouble. I have created the test program to show the problem:

#include <cuda_runtime.h>
#include <cutil_inline.h>
#include <stdio.h>

__global__ void k_testLocking(unsigned int* locks, int n) {
    int id = threadIdx.x % n;
    while (atomicExch(&(locks[id]), 1u) != 0u) {} //lock
    //critical section would go here
    atomicExch(&(locks[id]),0u); //unlock
}

int main(int argc, char** argv) {
    //initialize the locks array on the GPU to (0...0)
    unsigned int* locks;
    unsigned int zeros[10]; for (int i = 0; i < 10; i++) {zeros[i] = 0u;}
    cutilSafeCall(cudaMalloc((void**)&locks, sizeof(unsigned int)*10));
    cutilSafeCall(cudaMemcpy(locks, zeros, sizeof(unsigned int)*10, cudaMemcpyHostToDevice));

    //Run the kernel:
    k_testLocking<<<dim3(1), dim3(256)>>>(locks, 10);

    //Check the error messages:
    cudaError_t error = cudaGetLastError();
    cutilSafeCall(cudaFree(locks));
    if (cudaSuccess != error) {
        printf("error 1: CUDA ERROR (%d) {%s}\n", error, cudaGetErrorString(error));
        exit(-1);
    }
    return 0;
}

This code, unfortunately, hard freezes my machine for several seconds and finally exits, printing out the message:

fcudaSafeCall() Runtime API error in file , line XXX : the launch timed out and was terminated.

which means that one of those while loops is not returning, but it seems like this should work.

As a reminder atomicExch(unsigned int* address, unsigned int val) atomically sets the value of the memory location stored in address to val and returns the old value. So the idea behind my locking mechanism is that it is initially 0u, so one thread should get past the while loop and all other threads should wait on the while loop since they will read locks[id] as 1u. Then when the thread is done with the critical section it resets the lock to 0u so another thread can enter.

What am I missing?

By the way, I am compiling with:

nvcc -arch sm_11 -Ipath/to/cuda/C/common/inc XXX.cu
+2  A: 

Okay, I figured it out, and this is yet-another-one-of-the-cuda-paradigm-pains.

As any good cuda programmer knows (notice that I did not remember this which makes me a bad cuda programmer, I think) all threads in a warp must execute the same code. The code I wrote would work perfectly if not for this fact. As it is, however, there are likely to be two threads in the same warp accessing the same lock. If one of them acquires the lock, it just forgets about executing the loop, but it cannot continue past the loop until all other threads in its warp have completed the loop. Unfortunately the other thread will never complete because it is waiting for the first one to unlock.

Here is a kernel that will do the trick without error:

__global__ void k_testLocking(unsigned int* locks, int n) {
    int id = threadIdx.x % n;
    bool leaveLoop = false;
    while (!leaveLoop) {
        if (atomicExch(&(locks[id]), 1u) == 0u) {
            //critical section
            leaveLoop = true;
            atomicExch(&(locks[id]),0u);
        }
    } 
}
quadelirus
This has been discussed several times on the NVIDIA forums. I think the conclusion is that this only works if you can ensure that the number of blocks is less than or equal to the number of multiprocessors. If not, it can lead to deadlock. In other words, try to find another way of implementing your algorithm that doesn't require critical sections.
Eric
A: 

by the way u have to remember that global memory writes and ! reads aren't completed where u write them in the code ... so for this to be practice you need to add a global memfence ie __threadfence()

eri