cuda

Multiply reduce and find maximum& index in parallel

I have two arrays ffcorr_d and ref_d each having 19600 values. The first kernel simple_multiply does a multiply operation along with sum reduction. I instantiate this kernel with 49 blocks and 400 threads. simplemultiply <<< nblocksn, blocksize >>> (ffcorr_d, ref_d, out1_d, out2_d, d_indices); const int threads = 400; __global__ void ...

How to copy CUDA generated PBO to Texture with Mipmapping

I'm trying to copy a PBO into a Texture with automipmapping enabled, but it seems only the top level texture is generated (in other words, no mipmapping is occuring). I'm building a PBO using //Generate a buffer ID called a PBO (Pixel Buffer Object) glGenBuffers(1, pbo); //Make this the current UNPACK buffer glBindBuffer(GL_PIXEL_UNPA...

CUDA constant memory invalid symbol

struct d_struct { // stuff }; __device__ __constant__ d_struct structs[SIZE]; When I call cudaMemcpyToSymbol("structs", &h_struct, sizeof(d_struct), index * sizeof(d_struct), cudaMemcpyHostToDevice) on a d_struct "h_struct" in host memory, I get an "invalid device symbol" cuda error. ...

cuda kernel parameter

say I have a cuda kernel __global__ foo (int a, int b) { ... ... } where a and b are stored. Does this takes register space for each thread? ...

streaming multiprocessor number

how do I know how many streaming multiprocessors(SM) I have on my GTS 250? ...

cuda unused threads

say I have 64 threadds in a kernel __global__ void kernel( ... ) { int i = threadIdx.x; ... ... if (i < 32) { ... ... } } basically after a certain point, I won't use threads 32 to 63 any more. What are they gonna do then? Are they gonna still consume processor power, or they are just dead. ...

How to create a CUDA dll?

Hi all! I need to use cuda in my application. But i can't create a dll. Some code here. __global__ void calc(float *a, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; float val = a[idx]; if (idx < n){ a[idx] = 4.0 /(1.0 + val*val); } } ... extern "C" __declspec(dllexport) void GPU_Code ( f...

Size of statically allocated shared memory per block question with Compute Prof (Cuda/OpenCL)

In Nvidia's compute prof there is a column called "static private mem per work group" and the tooltip of it says "Size of statically allocated shared memory per block". My application shows that I am getting 64 (bytes I assume) per block. Does that mean I am using somewhere between 1-64 of those bytes or is the profiler just telling me t...

What size are integers when programming cuda kernels.

I can't seem to find an answer to this simple question in the Cuda Programming Guide: When compiling a kernel with nvcc, What size integer is declared by short, int, long, and long long? Does it depend on my host architecture, so I should use int16_t, int32_t, and int64_t, or is it always a fixed size? ...

function inside the cuda kernel

Is there any ways i can have a function inside cuda kernel. I mean my cuda kernel gets pretty long and hard to debug at one point. Thanks. ...

CUDA counting, reduction and thread warps

I'm trying to create a cuda program that counts the number of true values (defined by non-zero values) in a long vector through a reduction algorithm. I'm getting funny results. I get either 0 or (ceil(N/threadsPerBlock)*threadsPerBlock), neither is correct. __global__ void count_reduce_logical(int * l, int * cntl, int N){ // sum...

CUDA threading allocation

hai I gone through the CUDA programming guide where i can't understand the below thread allocation method. dim3 dimGrid(2, 2, 1); dim3 dimBlock(4, 2, 2); KernelFunction<<>>(. . .); Can some explain how the thread is allocated for the above condition?. ...

How to Compile CUDA App is Visual Studio 2010 ?

How to Compile CUDA App is Visual Studio 2010 ? Here are my steps: 1. Create Empty C++ project without precompiled headers 2. Add main.cpp int main() { return 0; } Add kernels.cu I referred to sample project MAtrixMul and copied its settings step by step. it can be complied now #include "cuda.h" __global__ void VecAdd(float*...

Cuda doesn't calculate what it is expected to, just silently ignores my code.

Hi everyone. I'm encountering a very strange problem: Mu 9800GT doesnt seem to calculate at all. I've tried all hello-worlds i've found in the internet, here's one of them: this program creates 1..100 array on hosts, sends it to device, calculates a square of each value, returns it to host, prints the results. #include "stdafx.h" #inc...

CUDA + VS2010 without VS2008

Hello, I'd like to know whether it's possible to program for CUDA without installing VS2008. At the moment I've got VS2010 installed on my primary development machine and I don't wanna mess things up installing VS2008. Furthermore, I would no use for it aside from CUDA. I've been doing a few searches and it looks like it should be possib...

coalesced read short integer cuda

say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed. If I understand ...

How to mitigate host + device memory tranfer bottlenecks in OpenCL/CUDA

If my algorithm is bottlenecked by host to device and device to host memory transfers, is the only solution a different or revised algorithm? ...

creating a linked list using Cuda

is it possible to create a linked list on a gpu using cuda? I am trying to do this and i am finding some dificulties. If i can't allocate dynamic memory in a cuda kernel, then how can i create a new nod and add it to the linked list? ...

CUDA kernels throw different results on 2 different GPUs(GeForce 8600M GT vs Quadro FX 770M)

Hi everybody, I've been working on an AES CUDA application and I have a kernel which performs ECB encryption on the GPU. In order to assure the logic of the algorithm is not modified when running in parallel I send a known input test vector provided by NIST and then from host code compare the output with the know test vector output prov...

pinned memory opencl, has anybody successfully used it?

I used the CL_MEM_ALLOC_HOST_PTR flag with my clCreateBuffer calls, but the Compute Profiler shows all my "host mem transfer type" as being Pageable. I tried it in two different kernel setups, but the profiler wouldn't show that I was using pinned memory. Is it just really random when a kernel gets to use pinned memory? Is it constraine...