views:

601

answers:

4

I'm working on a C# library which offloads certain work tasks to the GPU using NVIDIA's CUDA. An example of this is adding two arrays together using extension methods:

float[] a = new float[]{ ... }
float[] b = new float[]{ ... }
float[] c = a.Add(b);

The work in this code is done on the GPU. However, I would like it to be done asynchronously such that only when the result is needed will the code running on the CPU block (if the result is not finished on the GPU yet). To do this I've created an ExecutionResult class which hides the asynchronous execution. In use this looks as follows:

float[] a = new float[]{ ... }
float[] b = new float[]{ ... }
ExecutionResult res = a.Add(b);
float[] c = res; //Implicit converter

At the last line the program blocks if the data is done ready yet. I'm not certain of the best way to implement this blocking behavior inside the ExecutionResult class as I'm not very experienced with synchronizing threads and those sorts of things.

public class ExecutionResult<T>
{
    private T[] result;
    private long computed = 0;

    internal ExecutionResult(T[] a, T[] b, Action<T[], T[], Action<T[]>> f)
    {
        f(a, b, UpdateData); //Asych call - 'UpdateData' is the callback method
    }

    internal void UpdateData(T[] data)
    {
        if (Interlocked.Read(ref computed) == 0)
        {
            result = data;
            Interlocked.Exchange(ref computed, 1);
        }
    }

    public static implicit operator T[](ExecutionResult<T> r)
    {
        //This is obviously a stupid way to do it
        while (Interlocked.Read(ref r.computed) == 0)
        {
            Thread.Sleep(1);
        }

        return result;
    }
}

The Action passed to the constructor is an asynchronous method which performs the actual work on the GPU. The nested Action is the asynchronous callback method.

My main concern is how to best/most elegantly handle the waiting done in the converter but also if there are more appropriate ways to attack the problem as a whole. Just leave a comment if there is something I need to elaborate or explain further.

+1  A: 

I wonder if you couldn't use the regular Delegate.BeginInvoke/Delegate.EndInvoke here? If not, then a wait handle (such as a ManualResetEvent) might be an option:

using System.Threading;
static class Program {
    static void Main()
    {
        ThreadPool.QueueUserWorkItem(DoWork);

        System.Console.WriteLine("Main: waiting");
        wait.WaitOne();
        System.Console.WriteLine("Main: done");
    }
    static void DoWork(object state)
    {
        System.Console.WriteLine("DoWork: working");
        Thread.Sleep(5000); // simulate work
        System.Console.WriteLine("DoWork: done");
        wait.Set();
    }
    static readonly ManualResetEvent wait = new ManualResetEvent(false);

}

Note that you can do this just using object if you really want:

using System.Threading;
static class Program {
    static void Main()
    {
        object syncObj = new object();
        lock (syncObj)
        {
            ThreadPool.QueueUserWorkItem(DoWork, syncObj);

            System.Console.WriteLine("Main: waiting");
            Monitor.Wait(syncObj);
            System.Console.WriteLine("Main: done");
        }
    }
    static void DoWork(object syncObj)
    {

        System.Console.WriteLine("DoWork: working");
        Thread.Sleep(5000); // simulate work
        System.Console.WriteLine("DoWork: done");
        lock (syncObj)
        {
            Monitor.Pulse(syncObj);
        }
    }

}
Marc Gravell
+6  A: 

It's not clear to me how much this is a framework you're implementing and how much you're calling into other code, but I would follow the "normal" async pattern in .NET as far as possible.

Jon Skeet
+3  A: 

The solution I found to the problem is to pass a function to the ExecutionResult constructor which does two things. When run, it starts the asynchronous work and in addition, it returns another function which returns the desired result:

private Func<T[]> getResult;

internal ExecutionResult(T[] a, T[] b, Func<T[], T[], Func<T[]>> asynchBinaryFunction)
{
   getResult = asynchUnaryFunction(a);
}

public static implicit operator T[](ExecutionResult<T> r)
{
    return r.getResult();
}

The 'getResult' function blocks until the data has been calculated and fetched from the GPU. This works well with how the CUDA driver API is structured.

It is a quite clean and simple solution. Since C# allows anonymous functions to be created with access to the local scope it is simply a matter of replacing the blocking part of a method passed to the ExecutionResult constructor such that...

    ...

    status = LaunchGrid(func, length);

    //Fetch result
    float[] c = new float[length];
    status = CUDADriver.cuMemcpyDtoH(c, ptrA, byteSize);
    status = Free(ptrA, ptrB);

    return c;
}

becomes...

    ...

    status = LaunchGrid(func, length);

    return delegate
    {
        float[] c = new float[length];
        CUDADriver.cuMemcpyDtoH(c, ptrA, byteSize); //Blocks until work is done
        Free(ptrA, ptrB);
        return c;
    };
}
Morten Christiansen
A: 
Danny Varod