Hey all,
I heard that with the Nvidia compute profiler, it should be possible to get a comparison of how much time is being spent for arithmetic ops, memory ops, or on latency. I searched the profiler after running my program and I tried googling, but I don't see anything related to figuring out this metrics.
Can anybody help, is my qu...
I have been reading the programming guide for CUDA and OpenCL, and I cannot figure out what a bank conflict is. They just sort of dive into how to solve the problem without elaborating on the subject itself. I tried googling for bank conflict and bank conflict computer science but I couldn't find much. Can anybody help me understand or p...
One thing I haven't figured out and google isn't helping me, is why is it possible to have bank conflicts with shared memory, but not in global memory? Can there be bank conflicts with registers?
UPDATE
Wow I really appreciate the two answers from Tibbit and Grizzly. It seems that I can only give a green check mark to one answer though....
Hey all,
I am using Compute Prof 3.2 and a Geforce GTX 280. I have compute capability 1.3 then I believe.
This file, http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/visual_profiler_cuda/CUDA_Profiler_3.0.txt, seems to show that I should be able to see these fields since I am using a 1.x compute device. Well I don't s...
const char programSource[] =
"__kernel void vecAdd(__global int *a, __global int *b, __global int *c)"
"{"
" int gid = get_global_id(0);"
"for(int i=0; i<10; i++){"
" a[gid] = b[gid] + c[gid];}"
"}";
The kernel above is a vector addition done ten times per loop. I have used the prog...
On Snow Leopard with a macbook pro that has two graphics devices the following error is printed to stderr multiple times a second:
Wed Oct 6 02:35:27 nausicaa.local TestApp[92464] <Warning>:
CGDisplayIsCaptured: Fixing up display ID 0x4272ec2 for offline
mux head to 0x4272ec0
When I force the graphics device to be either Nvidia or In...
Hey all,
I have been having a tough time setting up an experiment where I allocate memory with CUDA on the device, take that pointer to memory on the device, use it in OpenCL, and return the results. I want to see if this is possible. I had a tough time getting a CUDA project to work so I just used Nvidia's template project in their SDK...
In Nvidia's compute prof there is a column called "static private mem per work group" and the tooltip of it says "Size of statically allocated shared memory per block". My application shows that I am getting 64 (bytes I assume) per block. Does that mean I am using somewhere between 1-64 of those bytes or is the profiler just telling me t...
I don't know whether this is the right forum. Anyway here is the question. In one of our application we display medical images and on top of them some algorithm generated bitmap. The real bitmap is a 16bit gray scale bitmap. From this we generate a color bitmap based on a look up table for eg
(0-100)->green
(100-200)->blue
(200>above)...
I want to use GPU for counting purposes. I need it to fall on to CPU if no GPU found and provide me with unified api. (interested in any .net for example №4)
...
If my algorithm is bottlenecked by host to device and device to host memory transfers, is the only solution a different or revised algorithm?
...
Hi everybody,
I've been working on an AES CUDA application and I have a kernel which performs ECB encryption on the GPU. In order to assure the logic of the algorithm is not modified when running in parallel I send a known input test vector provided by NIST and then from host code compare the output with the know test vector output prov...
The vector addition example has this code:
// Asynchronous write of data to GPU device
ciErr1 = clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcA, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize, srcA, 0, NULL, NULL);
ciErr1 |= clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcB, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize, srcB, 0, NULL,...
I have a compute capability 1.2 card. It reports gld_efficiency and gst_efficiency for me. My problem is that I sometimes get values beyond the 0-1 range, sometimes greater than 2. Page 57 of the User Guide for the Compute Visual Profiler states that they should be between 0-1, so I am confused. Can anybody explain?
...
It seems like 2 million floats should be no big deal, only 8MBs of 1GB of GPU RAM. I am able to allocate that much at times and sometimes more than that with no trouble. I get CL_OUT_OF_RESOURCES when I do a clEnqueueReadBuffer, which seems odd. Am I able to sniff out where the trouble really started? OpenCL shouldn't be failing like thi...
I would like to see an example of rendering with nVidia Cg to an offscreen frame buffer object.
The computers I have access to have graphic cards but no monitors (or X server). So I want to render my stuff and output them as images on the disk. The graphic cards are GTX285.
...
I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of "Width" (Matrix width) up to about 2500 or so.
int size = Width*Width*sizeof(float);
float* Md, *Nd, *Pd;
cudaError_t err ...