cuda

Threads hierarchy design in kernel in CUDA

Assuming a block has limit of 512 threads, say my kernel needs more than 512 threads for execution, how should one design the thread hierarchy for optimal performance? (case 1) 1st block - 512 threads 2nd block - remaining threads (case 2) distribute equal number of threads across certain blocks. ...

Using C++ templates or macros for compile time function generation

Hi all, I have a code that runs on an embedded system and it has to run really fast. I know C and macros, and this particular project is coded mostly in C but it also uses C++ templates [increasingly more]. There is an inline function: inline my_t read_memory(uint32 addr) { #if (CURRENT_STATE & OPTIMIZE_BITMAP) return readOptimiz...

CUDA driver version is insufficient for CUDA runtime version

I got the message "cutilCheckMsg() CUTIL CUDA error : kernel launch failure : CUDA driver version is insufficient for CUDA runtime version." while trying to run an example source code. Also happens for the function cutilSafeCall. Doing on the following environment. windows 7 64bits visual studio 2008 CUDA developer driver, toolkit a...

Call multiple times get_global_id() vs save the result in the local variable?

It is probably a silly question, but: How expensive is it to call some get_* function in OpenCL-kernels? Is it better to save the result for future usage in some local varialbe or to call the desired function whenever it needed? Or it is platform dependent? PS I think, cuda solves it better with various threadIdx variables. ...

C++ custom exceptions

hello. I have run into broken compiler, which does not allow exceptions to inherit from std::exception (nvcc 3.0). so had to create workaround: struct exception { explicit exception(const char* message) { what_ = message; } virtual const char *what() const throw() { return what_; } operator std::exception() con...

CUDA Basic Matrix Addition - Large Matrices

Hi all, I'm trying to add two 4800x9600 matrices, but am running into difficulties... It's a simple C=A+B operation... Here is the kernel: __global__ void matAdd_kernel(float* result,float* A,float* B,int size) { int x=blockIdx.x*blockDim.x+threadIdx.x; int y=blockIdx.y*blockDim.y+threadIdx.y; int idx=x*y+x; ...

Pointers in structs passed to CUDA

Hi folks, I've been messing around with this for a while now, but can't seem to get it right. I'm trying to copy objects that contain arrays into CUDA device memory (and back again, but I'll cross that bridge when I come to it): struct MyData { float *data; int dataLen; } void copyToGPU() { // Create dummy objects to copy int ...

Tutorial for CUDA + OpenGl.

I'm looking for simple beginner's tutorial for CUDA with OpenGL, and how to set the CUDA environment on Ubuntu. Thanks in advance. ...

Can I share cuda GPU device memory between host processes?

Is it possible to have two or more linux host processes that can access the same device memory? I have two processes streaming high data rate between them and I don't want to bring the data back out of the GPU to the host in process A just to pass it to process B who will memcpy h2d back into the GPU. Combining the multiple processes in...

CUDA Add Rows of a Matrix

Hi, I'm trying to add the rows of a 4800x9600 matrix together, resulting in a matrix 1x9600. What I've done is split the 4800x9600 into 9,600 matrices of length 4800 each. I then perform a reduction on the 4800 elements. The trouble is, this is really slow... Anyone got any suggestions? Basically, I'm trying to implement MATLAB's su...

CUDA - Maintain pointers to global memory

I have a .NET program that is utilizing CUDA. The CUDA is accessed through a C DLL. What I am doing is initializing my CUDA application by allocating buffers (cudaMalloc) on the device at program startup. Pointers to these buffers are then maintained in static variables declared in the DLL. Data is copied to and from the buffers thro...

parallel search in CUDA

Are there any CUDA methods/approaches/libraries for search operation, say a integer in an array of million entries ? More of a parallel search approach.. ...

Can I call a "function-like macro" in a header file from a CUDA __global__ function???

This is part of my header file ("aes_locl.h"): . . # define SWAP(x) (_lrotl(x, 8) & 0x00ff00ff | _lrotr(x, 8) & 0xff00ff00) # define GETU32(p) SWAP(*((u32 *)(p))) # define PUTU32(ct, st) { *((u32 *)(ct)) = SWAP((st)); } . . Now from .cu file I have declared a __ global__ function and included the header file like this : #include "...

CUDA programming

Hi all, I am new to CUDA. I had a question on a simple program, hope someone can notice my mistake. __global__ void ADD(float* A, float* B, float* C) { const int ix = blockDim.x * blockIdx.x + threadIdx.x; const int iy = blockDim.y * blockIdx.y + threadIdx.y; if(ix < 16 && iy < 16) { for(int i = 0; i<256; i++) C...

CUDA float to long long conversion and texture read giving incorrect result.

I would ask this in the CUDA forums but for some reason I can't get past the first page the registration, so here goes: nVidia Card: 9800 GT CUDA toolkit 3.0 Compiled for: compute capability 1.1 Scenario 1: float result = 0; float f1 = tex2D( tex, u, v ); float f2 = tex2D( tex, u + 1; v + 1 ); long long ll1 = __float2ll_rn...

CUThread lnk2001 error

1>Linking... 1>main.cu.obj : error LNK2001: unresolved external symbol cutWaitForThreads 1>main.cu.obj : error LNK2001: unresolved external symbol cutStartThread I get those errors when trying to compile my project. I have included the cutil64 in linker dependencies, but I can see that's not it. I can't seem to figure out what's wrong w...

Total/texture accessible memory by DirectX/Cuda/OpenGL

Hi, Can someone please explain the difference in texture memory as used in the context of Cuda as opposed to texture memory used in the context of DirectX. Suppose a graphics card has 512 MB of advertised memory, how is it divided into constant memory/texture memory and global memory. E.g. I have a tesla card that has totalConstMem as ...

Efficient reduction of 2D array in CUDA?

In the CUDA SDK, there is example code and presentation slides for an efficient one-dimensional reduction. I have also seen several papers on and implementations of one-dimensional reductions and prefix scans in CUDA. Is there efficient CUDA code available for a reduction of a dense two-dimensional array? Pointers to code or pertinent...

comparing Matlab vs CUDA correlation and reduction on a 2D array

I am trying to compare cross-correlation using FFT vs using windowing method. My Matlab code is: isize = 20; n = 7; for i = 1:n %%7x7 xcorr for j = 1:n xcout(i,j) = sum(sum(ffcorr1 .* ref(i:i+isize-1,j:j+isize-1))); %%ref is 676 element array and ffcorr1 is a 400 element array end end similar CUDA kernel: __global__ void xc_...

Compiling specific code NULLs my textures.

A very strange error: if I add some specific code to my project, any textures I use contain nothing but 0. Even when I'm not running any of the code that was added. The specific code here is the kernels of an nVidia CUDA sample [1], the Bicubic Texture Filtering sample, in specific the CatMulRom kernel. I've traced it down to one of the...