Assuming a block has limit of 512 threads, say my kernel needs more than 512 threads for execution, how should one design the thread hierarchy for optimal performance?
(case 1)
1st block - 512 threads
2nd block - remaining threads
(case 2) distribute equal number of threads across certain blocks.
...
Hi all,
I have a code that runs on an embedded system and it has to run really fast. I know C and macros, and this particular project is coded mostly in C but it also uses C++ templates [increasingly more]. There is an inline function:
inline my_t read_memory(uint32 addr) {
#if (CURRENT_STATE & OPTIMIZE_BITMAP)
return readOptimiz...
I got the message
"cutilCheckMsg() CUTIL CUDA error : kernel launch failure : CUDA driver version is insufficient for CUDA runtime version."
while trying to run an example source code.
Also happens for the function cutilSafeCall.
Doing on the following environment.
windows 7 64bits
visual studio 2008
CUDA developer driver, toolkit a...
It is probably a silly question, but:
How expensive is it to call some get_* function in OpenCL-kernels? Is it better to save the result for future usage in some local varialbe or to call the desired function whenever it needed?
Or it is platform dependent?
PS
I think, cuda solves it better with various threadIdx variables.
...
hello.
I have run into broken compiler, which does not allow exceptions to inherit from std::exception (nvcc 3.0).
so had to create workaround:
struct exception {
explicit exception(const char* message) {
what_ = message;
}
virtual const char *what() const throw() { return what_; }
operator std::exception() con...
Hi all,
I'm trying to add two 4800x9600 matrices, but am running into difficulties...
It's a simple C=A+B operation...
Here is the kernel:
__global__ void matAdd_kernel(float* result,float* A,float* B,int size)
{
int x=blockIdx.x*blockDim.x+threadIdx.x;
int y=blockIdx.y*blockDim.y+threadIdx.y;
int idx=x*y+x;
...
Hi folks,
I've been messing around with this for a while now, but can't seem to get it right. I'm trying to copy objects that contain arrays into CUDA device memory (and back again, but I'll cross that bridge when I come to it):
struct MyData {
float *data;
int dataLen;
}
void copyToGPU() {
// Create dummy objects to copy
int ...
I'm looking for simple beginner's tutorial for CUDA with OpenGL, and how to set the CUDA environment on Ubuntu.
Thanks in advance.
...
Is it possible to have two or more linux host processes that can access the same device memory?
I have two processes streaming high data rate between them and I don't want to bring the data back out of the GPU to the host in process A just to pass it to process B who will memcpy h2d back into the GPU.
Combining the multiple processes in...
Hi,
I'm trying to add the rows of a 4800x9600 matrix together, resulting in a matrix 1x9600.
What I've done is split the 4800x9600 into 9,600 matrices of length 4800 each. I then perform a reduction on the 4800 elements.
The trouble is, this is really slow...
Anyone got any suggestions?
Basically, I'm trying to implement MATLAB's su...
I have a .NET program that is utilizing CUDA.
The CUDA is accessed through a C DLL.
What I am doing is initializing my CUDA application by allocating buffers (cudaMalloc) on the device at program startup. Pointers to these buffers are then maintained in static variables declared in the DLL. Data is copied to and from the buffers thro...
Are there any CUDA methods/approaches/libraries for search operation, say a integer in an array of million entries ? More of a parallel search approach..
...
This is part of my header file ("aes_locl.h"):
.
.
# define SWAP(x) (_lrotl(x, 8) & 0x00ff00ff | _lrotr(x, 8) & 0xff00ff00)
# define GETU32(p) SWAP(*((u32 *)(p)))
# define PUTU32(ct, st) { *((u32 *)(ct)) = SWAP((st)); }
.
.
Now from .cu file I have declared a __ global__ function and included the header file like this :
#include "...
Hi all, I am new to CUDA. I had a question on a simple program, hope someone can notice my mistake.
__global__ void ADD(float* A, float* B, float* C)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < 16 && iy < 16)
{
for(int i = 0; i<256; i++)
C...
I would ask this in the CUDA forums but for some reason I can't get past the first page the registration, so here goes:
nVidia Card: 9800 GT
CUDA toolkit 3.0
Compiled for: compute capability 1.1
Scenario 1:
float result = 0;
float f1 = tex2D( tex, u, v );
float f2 = tex2D( tex, u + 1; v + 1 );
long long ll1 = __float2ll_rn...
1>Linking...
1>main.cu.obj : error LNK2001: unresolved external symbol cutWaitForThreads
1>main.cu.obj : error LNK2001: unresolved external symbol cutStartThread
I get those errors when trying to compile my project. I have included the cutil64 in linker dependencies, but I can see that's not it. I can't seem to figure out what's wrong w...
Hi,
Can someone please explain the difference in texture memory as used in the context of Cuda as opposed to texture memory used in the context of DirectX. Suppose a graphics card has 512 MB of advertised memory, how is it divided into constant memory/texture memory and global memory.
E.g. I have a tesla card that has totalConstMem as ...
In the CUDA SDK, there is example code and presentation slides for an efficient one-dimensional reduction. I have also seen several papers on and implementations of one-dimensional reductions and prefix scans in CUDA.
Is there efficient CUDA code available for a reduction of a dense two-dimensional array? Pointers to code or pertinent...
I am trying to compare cross-correlation using FFT vs using windowing method.
My Matlab code is:
isize = 20;
n = 7;
for i = 1:n %%7x7 xcorr
for j = 1:n
xcout(i,j) = sum(sum(ffcorr1 .* ref(i:i+isize-1,j:j+isize-1))); %%ref is 676 element array and ffcorr1 is a 400 element array
end
end
similar CUDA kernel:
__global__ void xc_...
A very strange error: if I add some specific code to my project, any textures I use contain nothing but 0. Even when I'm not running any of the code that was added.
The specific code here is the kernels of an nVidia CUDA sample [1], the Bicubic Texture Filtering sample, in specific the CatMulRom kernel. I've traced it down to one of the...