ansaurus

Question

CUDA global (as in C) dynamic arrays allocated to device memory

Answer 1

A:

Spend some time focusing on the copious documentation offered by NVIDIA.

From the Programming Guide:

float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));

That's a simple example of how to allocate memory. Now, in your kernels, you should accept a pointer to a float like so:

__global__
void kernel1(float *some_neat_data)
{
    some_neat_data[threadIdx.x]++;
}

__global__
void kernel2(float *potentially_that_same_neat_data)
{
    potentially_that_same_neat_data[threadIdx.x] *= 0.3f;
}

So now you can invoke them like so:

float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));

kernel1<<<1,128>>>(devPtr);
kernel2<<<1,128>>>(devPtr);

As this data is used in numerous functions, I would like it to be global.

There are few good reasons to use globals. This definitely is not one. I'll leave it as an exercise to expand this example to include moving "devPtr" to a global scope.

EDIT:

Ok, the fundamental problem is this: your kernels can only access device memory and the only global-scope pointers that they can use are GPU ones. When calling a kernel from your CPU, behind the scenes what happens is that the pointers and primitives get copied into GPU registers and/or shared memory before the kernel gets executed.

So the closest I can suggest is this: use cudaMemcpyToSymbol() to achieve your goals. But, in the background, consider that a different approach might be the Right Thing.

#include <algorithm>

__constant__ float devPtr[1024];

__global__
void kernel1(float *some_neat_data)
{
    some_neat_data[threadIdx.x] = devPtr[0] * devPtr[1];
}

__global__
void kernel2(float *potentially_that_same_neat_data)
{
    potentially_that_same_neat_data[threadIdx.x] *= devPtr[2];
}


int main(int argc, char *argv[])
{
    float some_data[256];
    for (int i = 0; i < sizeof(some_data) / sizeof(some_data[0]); i++)
    {
        some_data[i] = i * 2;
    }
    cudaMemcpyToSymbol(devPtr, some_data, std::min(sizeof(some_data), sizeof(devPtr) ));
    float* otherDevPtr;
    cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
    cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));

    kernel1<<<1,128>>>(otherDevPtr);
    kernel2<<<1,128>>>(otherDevPtr);

    return 0;
}

Don't forget '--host-compilation=c++' for this example.

2008-09-17 02:24:59

Yea - that was my solution originally. Only, not in constant memory because the array is rather to large :<So what is the verdict on __constant__ float* devPtr;(or in my case __device__ float* devPtr;)I suspect that there is a very good reason why you can't have a global pointer to device data

Voltaire 2008-09-17 03:21:55

Also - didn't see your edit. However, im still not sure why a *pointer* to device memory is invalid while an array is ok.

Voltaire 2008-09-17 03:28:08

Answer 2

A:

Erm, it was exactly that problem of moving devPtr to global scope that was my problem.

I have an implementation that does exactly that, with the two kernels having a pointer to data passed in. I explicitly don't want to pass in those pointers.

I have read the documentation fairly closely, and hit up the nvidia forums (and google searched for an hour or so), but I haven't found an implementation of a global dynamic device array that actually runs (i have tried several that compile and then fail in new and interesting ways).

Voltaire 2008-09-17 02:35:32

Answer 3

A:

Something like this should probably work.

#include <algorithm>

#define NDEBUG
#define CUT_CHECK_ERROR(errorMessage) do {                                 \
        cudaThreadSynchronize();                                           \
         cudaError_t err = cudaGetLastError();                             \
         if( cudaSuccess != err) {                                         \
                     fprintf(stderr, "Cuda error: %s in file '%s' in line %i : %s.\n",    \
                                             errorMessage, __FILE__, __LINE__, cudaGetErrorString( err) );\
                     exit(EXIT_FAILURE);                                                  \
                 } } while (0)


__device__ float *devPtr;

__global__
void kernel1(float *some_neat_data)
{
    devPtr = some_neat_data;
}

__global__
void kernel2(void)
{
    devPtr[threadIdx.x] *= .3f;
}


int main(int argc, char *argv[])
{
    float* otherDevPtr;
    cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
    cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));

    kernel1<<<1,128>>>(otherDevPtr);
    CUT_CHECK_ERROR("kernel1");

    kernel2<<<1,128>>>();

    CUT_CHECK_ERROR("kernel2");

    return 0;
}

Give it a whirl.

2008-09-17 03:52:41

Thats interesting. I see how cudaMalloc isnt being called directly on devPtr, but being set in that first kernel call.I will give that a try later and let you know if it works out, thanks very much :D

Voltaire 2008-09-17 04:03:35

Answer 4

A:

check out the samples included with the SDK. Many of those sample projects are a decent way to learn by example.

Mark Borgerding 2008-09-18 02:27:59

Answer 5

A:

I went ahead and tried the solution of allocating a temporary pointer and passing it to a simple global function similar to kernel1.

The good news is that it does work :)

However, I think it confuses the compiler as I now get "Advisory: Cannot tell what pointer points to, assuming global memory space" whenever I try to access the global data. Luckily, the assumption happens to be correct, but the warnings are annoying.

Anyway, for the record - I have looked at many of the examples and did run through the nvidia exercises where the point is to get the output to say "Correct!". However, I haven't looked at all of them. If anyone knows of an sdk example where they do dynamic global device memory allocation, I would still like to know.

Voltaire 2008-09-18 16:55:53

Answer 6

A:

Hi,

Can you share the code . I am tryingthe same. but having problem when working with large data.

thanks,

shahed

2009-07-07 20:04:16

Answer 7

A:

As this data is used in numerous functions, I would like it to be global.

-

There are few good reasons to use globals. This definitely is not one. I'll leave it as an exercise to expand this example to include moving "devPtr" to a global scope.

What if the kernel operates on a large const structure consisting of arrays? Using the so called constant memory is not an option, because it's very limited in size.. so then you have to put it in global memory..?

Nils 2010-08-18 11:21:28

ansaurus

tags:

views:

answers:

CUDA global (as in C) dynamic arrays allocated to device memory

related questions