I have two arrays ffcorr_d and ref_d each having 19600 values. The first kernel simple_multiply does a multiply operation along with sum reduction.
I instantiate this kernel with 49 blocks and 400 threads.
simplemultiply <<< nblocksn, blocksize >>> (ffcorr_d, ref_d, out1_d, out2_d, d_indices);
const int threads = 400;
__global__ void ...
I'm trying to copy a PBO into a Texture with automipmapping enabled, but it seems only the top level texture is generated (in other words, no mipmapping is occuring).
I'm building a PBO using
//Generate a buffer ID called a PBO (Pixel Buffer Object)
glGenBuffers(1, pbo);
//Make this the current UNPACK buffer
struct d_struct {
// stuff
__device__ __constant__ d_struct structs[SIZE];
When I call cudaMemcpyToSymbol("structs", &h_struct, sizeof(d_struct), index * sizeof(d_struct), cudaMemcpyHostToDevice) on a d_struct "h_struct" in host memory, I get an "invalid device symbol" cuda error.
say I have a cuda kernel
__global__ foo (int a, int b)
... ...
where a and b are stored. Does this takes register space for each thread?
how do I know how many streaming multiprocessors(SM) I have on my GTS 250?
say I have 64 threadds in a kernel
__global__ void kernel( ... )
int i = threadIdx.x;
... ...
if (i < 32)
... ...
basically after a certain point, I won't use threads 32 to 63 any more. What are they gonna do then? Are they gonna still consume processor power, or they are just dead.
Hi all! I need to use cuda in my application. But i can't create a dll. Some code here.
__global__ void calc(float *a, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = a[idx];
if (idx < n){
a[idx] = 4.0 /(1.0 + val*val);
extern "C" __declspec(dllexport) void GPU_Code ( f...
In Nvidia's compute prof there is a column called "static private mem per work group" and the tooltip of it says "Size of statically allocated shared memory per block". My application shows that I am getting 64 (bytes I assume) per block. Does that mean I am using somewhere between 1-64 of those bytes or is the profiler just telling me t...
I can't seem to find an answer to this simple question in the Cuda Programming Guide: When compiling a kernel with nvcc, What size integer is declared by short, int, long, and long long? Does it depend on my host architecture, so I should use int16_t, int32_t, and int64_t, or is it always a fixed size?
Is there any ways i can have a function inside cuda kernel. I mean my cuda kernel gets pretty long and hard to debug at one point. Thanks.
I'm trying to create a cuda program that counts the number of true values (defined by non-zero values) in a long vector through a reduction algorithm. I'm getting funny results. I get either 0 or (ceil(N/threadsPerBlock)*threadsPerBlock), neither is correct.
__global__ void count_reduce_logical(int * l, int * cntl, int N){
// sum...
I gone through the CUDA programming guide where i can't understand the below thread allocation method.
dim3 dimGrid(2, 2, 1);
dim3 dimBlock(4, 2, 2);
KernelFunction<<>>(. . .);
Can some explain how the thread is allocated for the above condition?.
How to Compile CUDA App is Visual Studio 2010 ?
Here are my steps:
1. Create Empty C++ project without precompiled headers
2. Add main.cpp
int main()
return 0;
Add kernels.cu
I referred to sample project MAtrixMul and copied its settings step by step. it can be complied now
#include "cuda.h"
__global__ void VecAdd(float*...
Hi everyone. I'm encountering a very strange problem: Mu 9800GT doesnt seem to calculate at all.
I've tried all hello-worlds i've found in the internet, here's one of them:
this program creates 1..100 array on hosts, sends it to device, calculates a square of each value, returns it to host, prints the results.
#include "stdafx.h"
I'd like to know whether it's possible to program for CUDA without installing VS2008.
At the moment I've got VS2010 installed on my primary development machine and I don't wanna mess things up installing VS2008. Furthermore, I would no use for it aside from CUDA.
I've been doing a few searches and it looks like it should be possib...
say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed.
If I understand ...
If my algorithm is bottlenecked by host to device and device to host memory transfers, is the only solution a different or revised algorithm?
is it possible to create a linked list on a gpu using cuda?
I am trying to do this and i am finding some dificulties.
If i can't allocate dynamic memory in a cuda kernel, then how can i create a new nod and add it to the linked list?
Hi everybody,
I've been working on an AES CUDA application and I have a kernel which performs ECB encryption on the GPU. In order to assure the logic of the algorithm is not modified when running in parallel I send a known input test vector provided by NIST and then from host code compare the output with the know test vector output prov...
I used the CL_MEM_ALLOC_HOST_PTR flag with my clCreateBuffer calls, but the Compute Profiler shows all my "host mem transfer type" as being Pageable. I tried it in two different kernel setups, but the profiler wouldn't show that I was using pinned memory.
Is it just really random when a kernel gets to use pinned memory? Is it constraine...