Spend some time focusing on the copious documentation offered by NVIDIA.
From the Programming Guide:
float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));
That's a simple example of how to allocate memory. Now, in your kernels, you should accept a pointer to a float like so:
__global__
void kernel1(float *some_neat_data)
{
some_neat_data[threadIdx.x]++;
}
__global__
void kernel2(float *potentially_that_same_neat_data)
{
potentially_that_same_neat_data[threadIdx.x] *= 0.3f;
}
So now you can invoke them like so:
float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));
kernel1<<<1,128>>>(devPtr);
kernel2<<<1,128>>>(devPtr);
As this data is used in numerous
functions, I would like it to be
global.
There are few good reasons to use globals. This definitely is not one. I'll leave it as an exercise to expand this example to include moving "devPtr" to a global scope.
EDIT:
Ok, the fundamental problem is this: your kernels can only access device memory and the only global-scope pointers that they can use are GPU ones. When calling a kernel from your CPU, behind the scenes what happens is that the pointers and primitives get copied into GPU registers and/or shared memory before the kernel gets executed.
So the closest I can suggest is this: use cudaMemcpyToSymbol() to achieve your goals. But, in the background, consider that a different approach might be the Right Thing.
#include <algorithm>
__constant__ float devPtr[1024];
__global__
void kernel1(float *some_neat_data)
{
some_neat_data[threadIdx.x] = devPtr[0] * devPtr[1];
}
__global__
void kernel2(float *potentially_that_same_neat_data)
{
potentially_that_same_neat_data[threadIdx.x] *= devPtr[2];
}
int main(int argc, char *argv[])
{
float some_data[256];
for (int i = 0; i < sizeof(some_data) / sizeof(some_data[0]); i++)
{
some_data[i] = i * 2;
}
cudaMemcpyToSymbol(devPtr, some_data, std::min(sizeof(some_data), sizeof(devPtr) ));
float* otherDevPtr;
cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));
kernel1<<<1,128>>>(otherDevPtr);
kernel2<<<1,128>>>(otherDevPtr);
return 0;
}
Don't forget '--host-compilation=c++' for this example.