tags:

views:

9917

answers:

7

So, im trying to write some code that utilizes Nvidia's CUDA architecture. I noticed that copying to and from the device was really hurting my overall performance, so now I am trying to move a large amount of data onto the device.

As this data is used in numerous functions, I would like it to be global. Yes, I can pass pointers around, but I would really like to know how to work with globals in this instance.

So, I have device functions that want to access a device allocated array.

Ideally, I could do something like:

__device__ float* global_data;

main()
{
  cudaMalloc(global_data);
  kernel1<<<blah>>>(blah); //access global data
  kernel2<<<blah>>>(blah); //access global data again
}

However, I havent figured out how to create a dynamic array. I figured out a work around by declaring the array as follows:

__device__ float global_data[REALLY_LARGE_NUMBER];

And while that doesn't require a cudaMalloc call, I would prefer the dynamic allocation approach.

A: 

Spend some time focusing on the copious documentation offered by NVIDIA.

From the Programming Guide:

float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));

That's a simple example of how to allocate memory. Now, in your kernels, you should accept a pointer to a float like so:

__global__
void kernel1(float *some_neat_data)
{
    some_neat_data[threadIdx.x]++;
}

__global__
void kernel2(float *potentially_that_same_neat_data)
{
    potentially_that_same_neat_data[threadIdx.x] *= 0.3f;
}

So now you can invoke them like so:

float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));

kernel1<<<1,128>>>(devPtr);
kernel2<<<1,128>>>(devPtr);

As this data is used in numerous functions, I would like it to be global.

There are few good reasons to use globals. This definitely is not one. I'll leave it as an exercise to expand this example to include moving "devPtr" to a global scope.

EDIT:

Ok, the fundamental problem is this: your kernels can only access device memory and the only global-scope pointers that they can use are GPU ones. When calling a kernel from your CPU, behind the scenes what happens is that the pointers and primitives get copied into GPU registers and/or shared memory before the kernel gets executed.

So the closest I can suggest is this: use cudaMemcpyToSymbol() to achieve your goals. But, in the background, consider that a different approach might be the Right Thing.

#include <algorithm>

__constant__ float devPtr[1024];

__global__
void kernel1(float *some_neat_data)
{
    some_neat_data[threadIdx.x] = devPtr[0] * devPtr[1];
}

__global__
void kernel2(float *potentially_that_same_neat_data)
{
    potentially_that_same_neat_data[threadIdx.x] *= devPtr[2];
}


int main(int argc, char *argv[])
{
    float some_data[256];
    for (int i = 0; i < sizeof(some_data) / sizeof(some_data[0]); i++)
    {
        some_data[i] = i * 2;
    }
    cudaMemcpyToSymbol(devPtr, some_data, std::min(sizeof(some_data), sizeof(devPtr) ));
    float* otherDevPtr;
    cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
    cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));

    kernel1<<<1,128>>>(otherDevPtr);
    kernel2<<<1,128>>>(otherDevPtr);

    return 0;
}

Don't forget '--host-compilation=c++' for this example.

Yea - that was my solution originally. Only, not in constant memory because the array is rather to large :<So what is the verdict on __constant__ float* devPtr;(or in my case __device__ float* devPtr;)I suspect that there is a very good reason why you can't have a global pointer to device data
Voltaire
Also - didn't see your edit. However, im still not sure why a *pointer* to device memory is invalid while an array is ok.
Voltaire
A: 

Erm, it was exactly that problem of moving devPtr to global scope that was my problem.

I have an implementation that does exactly that, with the two kernels having a pointer to data passed in. I explicitly don't want to pass in those pointers.

I have read the documentation fairly closely, and hit up the nvidia forums (and google searched for an hour or so), but I haven't found an implementation of a global dynamic device array that actually runs (i have tried several that compile and then fail in new and interesting ways).

Voltaire
A: 

Something like this should probably work.

#include <algorithm>

#define NDEBUG
#define CUT_CHECK_ERROR(errorMessage) do {                                 \
        cudaThreadSynchronize();                                           \
         cudaError_t err = cudaGetLastError();                             \
         if( cudaSuccess != err) {                                         \
                     fprintf(stderr, "Cuda error: %s in file '%s' in line %i : %s.\n",    \
                                             errorMessage, __FILE__, __LINE__, cudaGetErrorString( err) );\
                     exit(EXIT_FAILURE);                                                  \
                 } } while (0)


__device__ float *devPtr;

__global__
void kernel1(float *some_neat_data)
{
    devPtr = some_neat_data;
}

__global__
void kernel2(void)
{
    devPtr[threadIdx.x] *= .3f;
}


int main(int argc, char *argv[])
{
    float* otherDevPtr;
    cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
    cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));

    kernel1<<<1,128>>>(otherDevPtr);
    CUT_CHECK_ERROR("kernel1");

    kernel2<<<1,128>>>();

    CUT_CHECK_ERROR("kernel2");

    return 0;
}

Give it a whirl.

Thats interesting. I see how cudaMalloc isnt being called directly on devPtr, but being set in that first kernel call.I will give that a try later and let you know if it works out, thanks very much :D
Voltaire
A: 

check out the samples included with the SDK. Many of those sample projects are a decent way to learn by example.

Mark Borgerding
A: 

I went ahead and tried the solution of allocating a temporary pointer and passing it to a simple global function similar to kernel1.

The good news is that it does work :)

However, I think it confuses the compiler as I now get "Advisory: Cannot tell what pointer points to, assuming global memory space" whenever I try to access the global data. Luckily, the assumption happens to be correct, but the warnings are annoying.

Anyway, for the record - I have looked at many of the examples and did run through the nvidia exercises where the point is to get the output to say "Correct!". However, I haven't looked at all of them. If anyone knows of an sdk example where they do dynamic global device memory allocation, I would still like to know.

Voltaire
A: 

Hi,

Can you share the code . I am tryingthe same. but having problem when working with large data.

thanks,

shahed

A: 

As this data is used in numerous functions, I would like it to be global.

-

There are few good reasons to use globals. This definitely is not one. I'll leave it as an exercise to expand this example to include moving "devPtr" to a global scope.

What if the kernel operates on a large const structure consisting of arrays? Using the so called constant memory is not an option, because it's very limited in size.. so then you have to put it in global memory..?

Nils