tags:

views:

88

answers:

2

I am developing a program using cuda sdk and 9600 1 GB NVidia Card . In this program

0)A kernel passes a pointer of 2D int array of size 3000x6 in its input arguments.

1)The kenel has to sort it upto 3 levels (1st, 2nd & 3rd Column).

2)For this purpose, the kernel declares an array of int pointers of size 3000.

3)The kernel then populates the pointer array with the pointers pointing to the locations of input array in sorted order.

4)Finally the kernel copies the input array in an output array by dereferencing the pointers array.

This last step Fails an it halts the PC.

Q1)What are the guidelines of pointer de-referncing in cuda to fetch the contents of memory ?

, even a smallest array of 20x2 is not working correctly . the same code works outside cuda device memory ( ie, on standard C program )

Q2)Isn't it supposed to work the same as we do in standard C using '*' operator or there is some cudaapi to be used for it.?

+1  A: 

I just started looking into cuda, but I literally just read this out of a book. It sounds like it directly applies to you.

"You can pass pointers allocated with cudaMalloc() to functions that execute on the device.(kernals, right?)

You can use pointers allocated with cudaMalloc() to read or write memory from code that executes on the device .(kernals again)

You can pass pointers allocated with cudaMalloc to functions that execute on the host. (regular C code)

You CANNOT use pointers allocated with cudaMalloc() to read or write memory from code that executes on the host."

  • ^^ from "Cuda by Example" by Jason Sanders and Edward Kandrot published by Addison-Wesley yadda yadda no plagiarism here.

Since you are dereferencing inside the kernal, maybe the opposite of the last rule is also true. i.e. you cannot use pointers allocated by the host to read or write memory from code that executes on the device.

Edit: I also just noticed a function called cudaMemcpy

Looks like you would need to declare the 3000 int array twice in host code. One by calling malloc, the other by calling cudaMalloc. Pass the cuda one to the kernal as well as the input array to be sorted. Then after calling the kernal function:

cudaMemcpy(malloced_array, cudaMallocedArray, 3000*sizeof(int), cudaMemcpyDeviceToHost)

I literally just started looking into this like I said though so maybe theres a better solution.

Tom
Hi Tom, you should check out Thrust (http://code.google.com/p/thrust/) since if it is applicable in your project then it can be a great timesaver (both now and for maintenance).
Tom
A: 

CUDA code can use pointers in exactly the same manner as host code (e.g. dereference with * or [], normal pointer arithmetic and so on). However it is important to consider the location being accessed (i.e. the location to which the pointer points) must be visible to the GPU.

If you allocate host memory, using malloc() or std::vector for example, then that memory will not be visible to the GPU, it is host memory not device memory. To allocate device memory you should use cudaMalloc() - pointers to memory allocated using cudaMalloc() can be freely accessed from the device but not from the host.

To copy data between the two, use cudaMemcpy().

When you get more advanced the lines can be blurred a little, using "mapped memory" it is possible to allow the GPU to access parts of host memory but this must be handled in a particular way, see the CUDA Programming Guide for more information.

I'd strongly suggest you look at the CUDA SDK samples to see how all this works. Start with the vectorAdd sample perhaps, and any that are specific to your domain of expertise. Matrix multiplication and transpose are probably easy to digest too.

All the documentation, the toolkit and the code samples (SDK) are available on the CUDA developer web site.

Tom